Examine Assay Matrix Distributions

When employing TReNA, we are applying a solver to an assay matrix to infer relationships between transcription factors and target genes. Both the solver type and the assay matrix are mutable entities; there are currently 6 different solvers built into TReNA. The assay matrix can also be transformed in various ways. In this notebook, we will examine these two variables in an attempt to answer the following questions:

  1. How do the different solvers compare to one another on the same data?
  2. What is the effect of transforming the assay matrix in different ways?

To this end, we will be using the ampAD data of 154 transcription factors and 278 samples, with MEF2C as the target gene. Our three solvers will be LASSO, Random Forest, and Bayes Spike. The 4 different data transformations we will apply here are: as-is (unchanged), log2-transformed, asinh-transformed, and voom-transformed.

Details on the Data

For this exercise, we will be using RNAseq data gathered for Alzheimer's disease (AD) and related diseases Original Paper Here. These expression data are from samples taken during brain autopsies from 4 groups of subjects:

  1. Normal/control group, with no discernable cognitive impairment and no diagnosis of any neurodegnerative disease
  2. AD group, with cognitive impairment and a diagnosis of AD
  3. Progressive supranuclear palsy (PSP) group, with cognitive impairment and a diagnosis of PSP
  4. Pathelogical aging, with the presence of amyloid plaques associated with a diseased state, but with no discernible cognitive impairment.

MEF2C is a gene identified through genome-wide association studies to be associated with protection from AD. To better understand which transcription factors are associated with MEF2C, and thus may be associated with AD, we will use TReNA to identify meaningful relationships.

Transformation Characteristics

Let us first get a handle on what our data look like under the 4 different transformations.

  1. The "as-is" data are fairly straightforward; we have RNAseq data in units of RPKM with expression levels concentrated near 0, but the range of the data is quite large. This skewed dataset may not be ideal (though we don't exactly have a standard to compare to), so one way to adjust that skewed nature is to do a log2 transformation.

  2. For each value in the matrix, the log2 transformation subtracts the minimum expression value in the matrix, adds a very small number (0.001) to prevent any attempt at taking log2(0), and then takes log2() of the value. As a result, the log2-transformed data have a much smaller range and appear much more Gaussian than the as-is data. They also cross over 0 into the negative range, a characteristic that differs from the original data.

  3. The third data transformation is using asinh, the hyperbolic arcsine function. The actual math here is to simply take asinh() of every value in the matrix; the resulting distribution is scaled down, much like the log2-transformed data, but all values are positive.

  4. The fourth and final transformation is VOOM transformation, a technique that estimates the mean-variance relationship in the data and uses it to weigh each observation. The transformation is performed using the voom() function in the limma package from Biocondutor, which is documented here. The manuscript for the VOOM transformation can be found here.

For a better understanding of what these distributions look like, we can look at histograms of all 3:

In [1]:
# Load the assay matrix (mtx.sub) and transform it 3 ways
load("../extdata/ampAD.154genes.mef2cTFs.278samples.RData")

mtx.tmp <- mtx.sub - min(mtx.sub) + 0.001
mtx.log2 <- log2(mtx.tmp)

mtx.asinh <- asinh(mtx.sub)

suppressMessages(library(limma))
mtx.voom <- voom(mtx.sub)$E

# Plot them all
par(mfrow = c(4,1))
par(family = "sans")
hist(mtx.sub, main = "As-is Data")
hist(mtx.log2, main = "Log2-Transformed Data")
hist(mtx.asinh, main = "Asinh-Transformed Data")
hist(mtx.voom, main = "Voom-Transformed Data")

The data distributions are quite dissimilar, so it follows that we might expect different results when using them, even if we're using the same solver. To explicitly quantify these distributions, we can also take Tukey's five number summary (min,1st quartile, median, 3rd quartile, max) of each one:

In [2]:
cat("As-is :",fivenum(mtx.sub),"\n")
cat(" Log2 :",fivenum(mtx.log2),"\n")
cat("Asinh :",fivenum(mtx.asinh),"\n")
cat(" Voom :",fivenum(mtx.voom),"\n")
As-is : 0 1.753137 12.34697 43.24747 1027.766 
 Log2 : -9.965784 0.8107618 3.626201 5.434577 10.0053 
Asinh : 0 1.327453 3.208193 4.460219 7.62829 
 Voom : 6.379003 8.825687 11.33609 13.10218 17.47137 

As we expected, the as-is data have a much wider distribution but are concentrated around small values. The log2 and asinh data both scale down the distribution in different ways, with the log2 transformation incoporating negative values and the asinh transformation finding its minimum at 0.

Now that we've examined the distributions fairly thoroughly, let us move to using the solvers themselves to infer relationships between transcription factors and target genes. This will give us a better idea of how our distributions affect the solutions we infer and what we might want to be careful of in the future

Testing Distributions and Solvers in TReNA

First things first, we'll go ahead and load our function; this function will do several things:

  1. Load the "as-is" data and create the 3 different matrices we've examined above
  2. For each of those 3 matrices, create and solve a TReNA object for each of 4 solvers (LASSO, Bayes Spike, Random Forest, Square Root LASSO) with MEF2C as the target gene. This means we'll have 12 total solutions.
  3. For each of the 12 solutions, pull out the top 10 genes, then take the union of those 9 sets of 10. This will give us a list of genes representing the best hits for each combination of distribution and solver
  4. Compile the scores of those genes from the 12 solutions into a unified table (tbl.all), along with the gene names and their Pearson correlations to MEF2C.

Let's do that and produce our table.

In [3]:
# Source the desired function and generate the table
suppressMessages(source("../utils/evaluateAllSolvers.R"))
tbl.all <- assess_ampAD154AllSolversAndDistributions()
[1] --- Testing LASSO
[1] --- Testing Bayes Spike
[1] --- Testing Random Forest
[1] --- Testing Square Root LASSO

Just to get an idea of what we're looking at, here's the table. We can browse through this to see what the data look like.

In [4]:
tbl.all
genelasso.as.islasso.log2lasso.asinhlasso.voombs.as.isbs.log2bs.asinhbs.voomrf.as.isrf.log2rf.asinhrf.voomsqrt.lasso.as.issqrt.lasso.log2sqrt.lasso.asinhsqrt.lasso.voomgene.cor
9HLF 0.77458827 0.134199069 0.209422903 0.24105628 5.930056e-01 3.228768e-01 8.583764e-05 3.468772e-011266161.9489 57.37132247 25.91603826 53.05247084 0.69519133 0.140776460 0.142896092 0.15897283 0.92323940
5STAT4 2.60534321 0.157190023 0.076155038 0.09551869 1.708536e+00 1.832979e-01 8.946250e-05 1.708533e-01 649234.9844 46.37406567 22.47867569 48.94942356 3.47937884 0.190961358 0.158960329 0.17336532 0.90475987
35SATB1 0.00000000 0.000000000 0.000000000 0.00000000 -5.683068e-01-1.444623e-06 1.154825e-05-3.633166e-05 577058.4385 21.19406994 11.98698783 27.98823933 0.00000000 0.000000000 0.000000000 0.00000000 0.86282704
15SATB2 0.13675630 0.216133895 0.228754703 0.26140384 1.035786e+00 6.942847e-05 3.566778e-01 2.853092e-01 424329.6976 26.35567531 14.22518618 30.46469273 0.02139235 0.181298914 0.226526723 0.23814460 0.84232989
33ATF2 0.00000000 0.000000000 0.009811994 0.00000000 8.928837e-01 5.648769e-05 7.409788e-05-5.996884e-06 137474.9788 5.41129360 2.34014879 7.35197201 0.08809962 0.008485763 0.000000000 0.00000000 0.82786857
4FOXP2 2.61508513 0.106667287 0.120449138 0.13231032 3.972476e+00 8.526235e-05 2.309917e-01 4.508482e-05 112640.2790 3.91445249 2.97052657 3.80445761 2.44496515 0.101105760 0.106358127 0.10469561 0.81961721
11TSHZ2 0.69243601 0.104562826 0.115311122 0.09676352 1.086813e+00 2.277099e-01 1.056974e-01 1.235520e-01 84319.1960 5.12252917 2.25117903 4.60443070 1.00207756 0.100506784 0.094864545 0.08923052 0.77043615
40DRGX 0.00000000 0.000000000 0.000000000 0.00000000 -4.072208e-03 1.835087e-07 7.377633e-06 1.398560e-05 88685.5631 10.12862043 3.22875065 8.62751278 1.18904754 0.000000000 0.000000000 0.00000000 0.73094500
1HDX 11.26492428 0.000000000 0.000000000 0.00000000 3.241756e-01-2.498712e-05 6.947728e-05 5.735278e-05 53018.0228 1.43641155 0.71578023 0.86613053 13.37342127 0.000000000 0.000000000 0.00000000 0.72419056
8LHX6 1.27950800 0.066352215 0.090529696 0.09315715 2.312552e+00 6.697943e-05 1.139972e-01 1.365092e-01 33707.1603 2.90733938 1.70273480 2.28153347 2.69414714 0.058348346 0.086557473 0.09288872 0.71461154
12ESRRG 0.66926539 0.019431010 0.046640155 0.01141497 3.927342e-01 3.936057e-05 2.318925e-04 7.762502e-06 66430.6662 4.29280477 1.61084504 3.77395550 0.45898430 0.004540240 0.010904356 0.00000000 0.70903181
19TSHZ3 0.00000000 0.101236743 0.064662784 0.04451921 -1.089643e-02 8.721712e-05 1.673391e-01 2.551630e-05 23971.7861 6.25117359 3.70888147 9.28067022 0.00000000 0.103275043 0.119569865 0.09920060 0.70562897
38POU6F2 0.00000000 0.000000000 0.000000000 0.00000000 -3.743652e-02 1.548796e-01 1.126459e-05-6.461413e-06 7735.8553 1.54216776 0.61268809 1.62421935 0.00000000 0.000000000 0.000000000 0.00000000 0.68969765
14NFE2L2 -0.32653069 -0.015204118 -0.057125918 -0.04777479 -7.803000e-01-1.304394e-04-8.658551e-05-1.444340e-01 16250.1818 0.62406681 0.26788921 1.05073252 -0.01666623 -0.010311681 -0.001796439 -0.01195379 -0.63838196
17CUX2 0.03676364 0.074610279 0.134597803 0.11559182 1.240175e+00 1.096828e-01 2.066298e-01 6.543902e-05 13627.7613 3.79983083 1.69602385 3.33980250 0.00000000 0.059819887 0.070841315 0.05577775 0.62765043
18SOX12 -0.02284905 -0.065376926 -0.009410002 0.00000000 2.385374e-02-4.011697e-05-2.834563e-01-2.815619e-05 75863.7094 1.74741861 0.71539459 1.06994890 -0.05066787 -0.062815144 -0.063477372 -0.09422693 -0.62404571
10HOMEZ -0.75952764 0.000000000 0.000000000 0.00000000 -2.571221e-01-2.773147e-05-1.190609e-04-2.070808e-04 2075.6167 0.23303020 0.11305491 0.13354243 -0.05711910 0.000000000 0.000000000 0.00000000 -0.58326686
34STAT5A 0.00000000 0.000000000 0.000000000 0.00000000 -5.317519e+00 1.619035e-05 5.634046e-05 6.416900e-05 2473.5968 0.70519199 0.07501966 0.34651510 0.00000000 0.000000000 0.000000000 0.00000000 -0.57464748
21FOXO4 0.00000000 -0.059932873 -0.038120414 -0.09406223 -7.525646e-02-1.089069e-01-7.967480e-05-1.649238e-01 100624.4622 2.53966391 1.35220598 4.10752152 0.00000000 -0.060840837 -0.054144051 -0.09392238 -0.55645498
16STAT1 0.10044635 0.058977896 0.039676356 0.01249171 1.452155e-02 1.550252e-04 5.356998e-04-7.697100e-06 49426.5314 1.67592402 0.76834548 1.64999002 0.14023646 0.061726178 0.056424616 0.04170972 0.44791101
28ZHX3 0.00000000 0.000000000 -0.085656799 -0.04603107 8.553841e-04-3.468578e-01-8.150376e-05-1.474584e-04 2775.8698 0.12460750 0.05507819 0.06762929 0.00000000 0.000000000 0.000000000 0.00000000 -0.43410051
23FOXD1 0.00000000 -0.020514333 -0.043780174 -0.04197126 -1.447640e-01-8.077647e-05-1.312632e-01-1.355494e-04 3940.9648 0.17796819 0.07402906 0.17825892 0.00000000 -0.018612142 -0.030152933 -0.02430439 -0.41483052
20STAT5B 0.00000000 -0.072874603 -0.124676757 -0.24554706 -8.615242e-03-1.028715e-04-6.899009e-05-5.172639e-03 1962.9467 0.14380142 0.09304024 0.14962969 -0.03142925 -0.044682367 -0.088535616 -0.20937298 -0.37452622
6NFE2L3 -2.44526737 -0.042077199 -0.012117383 0.00000000 -3.535628e+00-6.050182e-04-1.180878e-03 8.473173e-06 1780.2860 0.90328191 0.43550208 0.85687215 -0.26084529 -0.038226731 -0.036547702 -0.02745371 -0.33931986
29FOXL1 0.00000000 0.000000000 -0.055243544 -0.02501204 -2.179699e+01-1.023130e-05-1.109603e-05-3.033104e-04 745.0412 0.04530632 0.01952745 0.06490894 0.00000000 0.000000000 0.000000000 0.00000000 -0.28785265
30RAX2 0.00000000 0.000000000 -0.054872150 -0.07062620 2.104629e+01-2.984847e-05-3.858662e-03-1.693215e-03 10191.6573 0.15838869 0.08666600 0.13254073 -3.68593268 0.000000000 -0.079070690 -0.05256948 -0.28106544
32FOXQ1 0.00000000 0.000000000 0.012481789 0.01484751 1.497996e-01 1.024722e-04 9.058601e-02 1.950611e-05 1912.3582 0.07622154 0.02708449 0.04248428 0.00000000 0.000000000 0.000000000 0.00000000 -0.26418329
7SOX15 -2.00952281 -0.047167519 -0.081356107 -0.04556199 -5.808046e+00-1.649264e-01-2.072168e-04-1.738663e-04 3073.3194 0.09674146 0.04575365 0.17657099 -0.20871913 -0.034963116 -0.033217226 0.00000000 -0.26062515
13ZFHX2 -0.44426232 -0.064360137 -0.076382960 -0.03506666 -9.485212e-01-1.449449e-01-1.972800e-01-5.449061e-05 4242.0609 0.10108451 0.05349746 0.11198257 -0.05434429 -0.044781408 -0.056811985 -0.02709282 -0.24922722
36FOXD4L6 0.00000000 0.000000000 0.000000000 0.00000000 2.001216e+01 5.598696e-05-3.493371e-05 3.952855e-05 2138.4440 0.38560704 0.08815874 0.26016141 0.00000000 0.000000000 0.000000000 0.00000000 0.24249449
39SOX6 0.00000000 0.000000000 0.000000000 0.00000000 1.117300e-02 1.410110e-01 2.635437e-04 6.368482e-05 1286.4206 0.02964059 0.02545302 0.07550225 0.00000000 0.000000000 0.000000000 0.00000000 -0.18744712
31FOXE3 0.00000000 0.000000000 -0.023970663 0.00000000 -3.409200e+01-7.539808e-07-2.472980e-04 1.070491e-05 1436.7861 0.09033598 0.02945399 0.08047401 0.00000000 0.000000000 0.000000000 0.00000000 -0.15567610
3FOXD4 -6.12834538 0.000000000 -0.018794219 -0.01121346 -9.572871e-01-2.277717e-04-2.120234e-04-1.566068e-04 7525.4338 0.13565668 0.05591150 0.14653822 -1.39091255 0.000000000 -0.005870526 0.00000000 -0.11598769
26OTX2 0.00000000 0.000000000 -0.104669744 -0.06159170 -8.116979e-02-6.058279e-06 1.893926e-05-2.229342e-01 1023.9351 0.11237214 0.05414230 0.15764033 0.00000000 0.000000000 -0.035312214 -0.01896358 -0.11114757
25FOXP3 0.00000000 0.004366561 0.061610938 0.02631893 8.865602e+00 7.248455e-04 1.555273e-02 1.446537e-03 1611.4674 0.06884909 0.02423162 0.08531106 0.00000000 0.000000000 0.000000000 0.00000000 -0.10724617
37ELF3 0.00000000 0.000000000 0.000000000 0.00000000 7.590501e+00-1.672171e-04 3.211889e-04 4.752853e-05 3326.0816 0.14770484 0.06169517 0.18053208 0.00000000 0.000000000 0.000000000 0.00000000 -0.10576117
2FOXD4L1 -10.64033243 0.000000000 -0.051396874 -0.05015705 -7.279581e+00-5.366559e-05-2.090862e-03-1.683918e-03 4436.9676 0.19584650 0.07824253 0.21936956 -2.38450203 0.000000000 -0.014643134 -0.01698541 -0.10509003
22ZBTB16 0.00000000 -0.020936732 -0.035129792 -0.07343806 -1.758344e-01-2.965382e-04-1.843154e-04-2.008173e-01 1771.1880 0.07772086 0.03064144 0.06150837 -0.02690810 -0.011124576 -0.014971105 -0.04979472 0.06296197
24ESR2 0.00000000 -0.012150312 -0.093702601 -0.05702639 5.680673e-02-1.076067e-01-1.918502e-03-1.278182e-01 1254.3418 0.06494721 0.02666922 0.10061373 0.00000000 -0.002660900 -0.008303913 0.00000000 -0.03964059
27ESRRB 0.00000000 0.000000000 -0.091401413 -0.05857560 -3.068364e+01-1.117221e-03-7.120074e-04-1.932052e-04 1860.4361 0.08006934 0.04726370 0.07232247 0.00000000 0.000000000 0.000000000 0.00000000 -0.03485913
In [5]:
dim(tbl.all)
  1. 40
  2. 18

As we can see here, the union of top 10 genes from all solutions results in 40 genes. We can see that both LASSO methods are behaving much as we'd expect them, with some coefficients being shrunken to 0. We can see a little of what's going on with the scores in general--Random Forest looks to line up with itself regardless of distribution, whereas the others do not--but rather than draw conclusions based on a cursory look, let's examine the correlations more closely.

A Broad view

We'll take a look at the correlations between all different columns to give us a broad look at our table. We'll output these correlations as a plot of pairwise correlations.

In [6]:
# Plot all pairs
par(family = "sans")
pairs(tbl.all[2:18],
     labels = names(tbl.all)[2:18])

A few things stand out here; first of all, the 4 different Random Forest distributions and the 4 different Square Root LASSO distributions look to be in very good agreement, producing the linear plots found near the bottom right corner. The other 2 solvers don't show nearly the same level of correlation between different distributions, though LASSO looks at least somewhat similar when comparing the log2 and asinh data. When it comes to the gene correlations themselves, LASSO looks to be the most similar, particularly for the asinh- transformed case.

This broad view can clue us in to some trends, but it's likely that we'll get much better resolution by looking more directly at smaller pieces of the data. Let us now zoom in to look more closely at how the different cases compare. We'll begin by looking at how different distributions affect results within the same solver, then shift to how different solvers affect results within the same distribution.

We are interested in the question: how does transforming the data affect the correlations between genes? We'll look at the top genes found by the solvers, sorted by their Pearson correlation with the target gene. First, let's look at LASSO:

In [7]:
head(tbl.all[,c(1,2,3,4,5,18)],10)
genelasso.as.islasso.log2lasso.asinhlasso.voomgene.cor
9HLF 0.7745883 0.13419907 0.2094229030.24105628 0.9232394
5STAT4 2.6053432 0.15719002 0.0761550380.09551869 0.9047599
35SATB1 0.0000000 0.00000000 0.0000000000.00000000 0.8628270
15SATB2 0.1367563 0.21613389 0.2287547030.26140384 0.8423299
33ATF2 0.0000000 0.00000000 0.0098119940.00000000 0.8278686
4FOXP2 2.6150851 0.10666729 0.1204491380.13231032 0.8196172
11TSHZ2 0.6924360 0.10456283 0.1153111220.09676352 0.7704362
40DRGX 0.0000000 0.00000000 0.0000000000.00000000 0.7309450
1HDX 11.2649243 0.00000000 0.0000000000.00000000 0.7241906
8LHX6 1.2795080 0.06635221 0.0905296960.09315715 0.7146115

As we would expect from LASSO, many genes are missing coefficients because they've been shrunken out of the model. What we wouldn't necessarily expect is that the remaining coefficients don't necessarily line up well with one another. To quantify that, let us compare the LASSO coefficients between transformation types.

In [8]:
# For each pairwise comparison, find correlation
cor(tbl.all[2:5])
lasso.as.islasso.log2lasso.asinhlasso.voom
lasso.as.is1.00000000.18914920.21357050.1865346
lasso.log20.18914921.00000000.84798430.8549927
lasso.asinh0.21357050.84798431.00000000.9336374
lasso.voom0.18653460.85499270.93363741.0000000

Correlation between the coefficients found for the "as-is" data with either transformed data type aren't particularly high, but transformed matrices are in much better agreement. We can similarly look at the results from using Bayes Spike as the solver:

In [9]:
head(tbl.all[,c(1,6:9,18)],10)
genebs.as.isbs.log2bs.asinhbs.voomgene.cor
9HLF 0.593005603 3.228768e-018.583764e-05 3.468772e-010.9232394
5STAT4 1.708535525 1.832979e-018.946250e-05 1.708533e-010.9047599
35SATB1 -0.568306798 -1.444623e-061.154825e-05 -3.633166e-050.8628270
15SATB2 1.035785852 6.942847e-053.566778e-01 2.853092e-010.8423299
33ATF2 0.892883655 5.648769e-057.409788e-05 -5.996884e-060.8278686
4FOXP2 3.972475657 8.526235e-052.309917e-01 4.508482e-050.8196172
11TSHZ2 1.086813411 2.277099e-011.056974e-01 1.235520e-010.7704362
40DRGX -0.004072208 1.835087e-077.377633e-06 1.398560e-050.7309450
1HDX 0.324175587 -2.498712e-056.947728e-05 5.735278e-050.7241906
8LHX6 2.312551744 6.697943e-051.139972e-01 1.365092e-010.7146115

From visual inspection, these coefficients are a bit all over the map in terms of magnitude and sign. Additionally, most of the coefficients for the log-transformed, and many from the asinh-transformed, are quite small. Let's look at these correlations:

In [10]:
cor(tbl.all[6:9])
bs.as.isbs.log2bs.asinhbs.voom
bs.as.is1.000000000.057407330.075524360.04127314
bs.log20.057407331.000000000.169026150.47330622
bs.asinh0.075524360.169026151.000000000.30836121
bs.voom0.041273140.473306220.308361211.00000000

There appears to be very little agreement between the as-is Bayes Spike coefficients and the other transformations, but there's some agreement in the asinh and voom transformed datasets. This is somewhat similar to what we saw for the LASSO solver, though all correlations for the Bayes Spike coefficients are weaker.

Let's continue our survey of the solvers by looking at the Random Forest scores:

In [11]:
head(tbl.all[,c(1,10:13,18)],10)
generf.as.isrf.log2rf.asinhrf.voomgene.cor
9HLF 1266161.9557.371322 25.916038353.05247080.9232394
5STAT4 649234.9846.374066 22.478675748.94942360.9047599
35SATB1 577058.4421.194070 11.986987827.98823930.8628270
15SATB2 424329.7026.355675 14.225186230.46469270.8423299
33ATF2 137474.98 5.411294 2.3401488 7.35197200.8278686
4FOXP2 112640.28 3.914452 2.9705266 3.80445760.8196172
11TSHZ2 84319.20 5.122529 2.2511790 4.60443070.7704362
40DRGX 88685.5610.128620 3.2287507 8.62751280.7309450
1HDX 53018.02 1.436412 0.7157802 0.86613050.7241906
8LHX6 33707.16 2.907339 1.7027348 2.28153350.7146115

This looks a lot more in line with what we would expect; the Random Forest scores look to decrease in size almost monotonically along with the Pearson correlations. Furthermore, they look to do so regardless of whether or not the data have been transformed, although the coefficient sizes vary quite a bit. Let's follow up with the correlations between different transformations.

In [12]:
cor(tbl.all[10:13])
rf.as.isrf.log2rf.asinhrf.voom
rf.as.is1.00000000.96505160.96048150.9529617
rf.log20.96505161.00000000.99498910.9913273
rf.asinh0.96048150.99498911.00000000.9972757
rf.voom0.95296170.99132730.99727571.0000000

The correlation coefficients between matrix types suggests very high agreement in all cases, and nearly identical trends when using log-transformed and asinh-transformed data. This is important information as we think about using Random Forest on other datasets; we can likely expect that regardless of transformation, we'll likely find similar results.

Let's finish our survey of solvers by looking at Square Root LASSO:

In [13]:
head(tbl.all[,c(1,14:17,18)],10)
genesqrt.lasso.as.issqrt.lasso.log2sqrt.lasso.asinhsqrt.lasso.voomgene.cor
9HLF 0.695191330.1407764600.14289609 0.15897283 0.9232394
5STAT4 3.479378840.1909613580.15896033 0.17336532 0.9047599
35SATB1 0.000000000.0000000000.00000000 0.00000000 0.8628270
15SATB2 0.021392350.1812989140.22652672 0.23814460 0.8423299
33ATF2 0.088099620.0084857630.00000000 0.00000000 0.8278686
4FOXP2 2.444965150.1011057600.10635813 0.10469561 0.8196172
11TSHZ2 1.002077560.1005067840.09486455 0.08923052 0.7704362
40DRGX 1.189047540.0000000000.00000000 0.00000000 0.7309450
1HDX 13.373421270.0000000000.00000000 0.00000000 0.7241906
8LHX6 2.694147140.0583483460.08655747 0.09288872 0.7146115

Much like regular LASSO, we see many coefficients shrunken to 0 and there seems to be good agreement in which coefficients are 0 in the transformed cases, particularly between the asinh and voom transformations. We'll look at the correlations to get a better look

In [14]:
cor(tbl.all[14:17])
sqrt.lasso.as.issqrt.lasso.log2sqrt.lasso.asinhsqrt.lasso.voom
sqrt.lasso.as.is1.00000000.18545670.22544510.1996011
sqrt.lasso.log20.18545671.00000000.95976720.9139551
sqrt.lasso.asinh0.22544510.95976721.00000000.9490039
sqrt.lasso.voom0.19960110.91395510.94900391.0000000

The as-is coefficients don't really line up with the others, but all 3 transformations are in close agreement here.

From looking at the within-solvers comparisons, it seems evident that we can probably expect some major differences between solver types. Let's look first at the as-is data across solvers; we'll look at the columns in a table and compute the correlation coefficients concurrently.

In [15]:
head(tbl.all[,c(1,2,6,10,14,18)],10)
cor(tbl.all[c(2,6,10,14,18)])
genelasso.as.isbs.as.isrf.as.issqrt.lasso.as.isgene.cor
9HLF 0.7745883 0.5930056031266161.95 0.69519133 0.9232394
5STAT4 2.6053432 1.708535525 649234.98 3.47937884 0.9047599
35SATB1 0.0000000 -0.568306798 577058.44 0.00000000 0.8628270
15SATB2 0.1367563 1.035785852 424329.70 0.02139235 0.8423299
33ATF2 0.0000000 0.892883655 137474.98 0.08809962 0.8278686
4FOXP2 2.6150851 3.972475657 112640.28 2.44496515 0.8196172
11TSHZ2 0.6924360 1.086813411 84319.20 1.00207756 0.7704362
40DRGX 0.0000000 -0.004072208 88685.56 1.18904754 0.7309450
1HDX 11.2649243 0.324175587 53018.02 13.37342127 0.7241906
8LHX6 1.2795080 2.312551744 33707.16 2.69414714 0.7146115
lasso.as.isbs.as.isrf.as.issqrt.lasso.as.isgene.cor
lasso.as.is1.0000000 0.117961080.14940040 0.797932090.3394498
bs.as.is0.1179611 1.000000000.07497072 -0.011880950.1611727
rf.as.is0.1494004 0.074970721.00000000 0.122369830.5062618
sqrt.lasso.as.is0.7979321 -0.011880950.12236983 1.000000000.3866010
gene.cor0.3394498 0.161172700.50626182 0.386601051.0000000

As we might expect, the 2 LASSO variations have the highest correlation, but otherwise there's not very good agreement between solvers. In terms of the gene correlations themselves, the Random Forest solver has the strongest agreement for the "as-is" data matrix, both LASSO variations are a little weaker, and the Bayes Spike shows virtually no correlation with the Pearson coefficients. Let us move on to the transformations to see what differences there are; we'll begin with log2-transformed data.

In [16]:
head(tbl.all[,c(1,3,7,11,15,18)],10)
cor(tbl.all[c(3,7,11,15,18)])
genelasso.log2bs.log2rf.log2sqrt.lasso.log2gene.cor
9HLF 0.13419907 3.228768e-0157.371322 0.140776460 0.9232394
5STAT4 0.15719002 1.832979e-0146.374066 0.190961358 0.9047599
35SATB1 0.00000000 -1.444623e-0621.194070 0.000000000 0.8628270
15SATB2 0.21613389 6.942847e-0526.355675 0.181298914 0.8423299
33ATF2 0.00000000 5.648769e-05 5.411294 0.008485763 0.8278686
4FOXP2 0.10666729 8.526235e-05 3.914452 0.101105760 0.8196172
11TSHZ2 0.10456283 2.277099e-01 5.122529 0.100506784 0.7704362
40DRGX 0.00000000 1.835087e-0710.128620 0.000000000 0.7309450
1HDX 0.00000000 -2.498712e-05 1.436412 0.000000000 0.7241906
8LHX6 0.06635221 6.697943e-05 2.907339 0.058348346 0.7146115
lasso.log2bs.log2rf.log2sqrt.lasso.log2gene.cor
lasso.log21.00000000.49125280.66011020.98526550.7078153
bs.log20.49125281.00000000.54006620.51515360.4787279
rf.log20.66011020.54006621.00000000.71660400.5547292
sqrt.lasso.log20.98526550.51515360.71660401.00000000.7057869
gene.cor0.70781530.47872790.55472920.70578691.0000000

Once again, the 2 LASSO methods are in close agreement, this time agreeing nearly identically. We now also see much stronger agreement between all methods, with correlation coefficients in the 0.5-0.66 range. In contrast to the "as-is", the log2-transformed data demonstrate at least some correlation between each solver and the gene correlation coefficients, though the 2 LASSO methods have the highest coefficients. Now we'll look at the asinh-transformed data

In [17]:
head(tbl.all[,c(1,4,8,12,16,18)],10)
cor(tbl.all[c(4,8,12,16,18)])
genelasso.asinhbs.asinhrf.asinhsqrt.lasso.asinhgene.cor
9HLF 0.209422903 8.583764e-0525.9160383 0.14289609 0.9232394
5STAT4 0.076155038 8.946250e-0522.4786757 0.15896033 0.9047599
35SATB1 0.000000000 1.154825e-0511.9869878 0.00000000 0.8628270
15SATB2 0.228754703 3.566778e-0114.2251862 0.22652672 0.8423299
33ATF2 0.009811994 7.409788e-05 2.3401488 0.00000000 0.8278686
4FOXP2 0.120449138 2.309917e-01 2.9705266 0.10635813 0.8196172
11TSHZ2 0.115311122 1.056974e-01 2.2511790 0.09486455 0.7704362
40DRGX 0.000000000 7.377633e-06 3.2287507 0.00000000 0.7309450
1HDX 0.000000000 6.947728e-05 0.7157802 0.00000000 0.7241906
8LHX6 0.090529696 1.139972e-01 1.7027348 0.08655747 0.7146115
lasso.asinhbs.asinhrf.asinhsqrt.lasso.asinhgene.cor
lasso.asinh1.00000000.60897270.61904880.85869450.6848150
bs.asinh0.60897271.00000000.22640540.70044670.5063666
rf.asinh0.61904880.22640541.00000000.68425370.5657811
sqrt.lasso.asinh0.85869450.70044670.68425371.00000000.7040126
gene.cor0.68481500.50636660.56578110.70401261.0000000

Outside of the expected high agreement between LASSO methods, this transformation also shows fairly strong agreement between Random Forest and the LASSO methods. Bayes Spike is less correlated with the others, but still much more so than in the as-is case. Compared to the gene correlation coefficients, the 2 LASSO methods still show the strongest agreement and Bayes Spike has fallen off a bit. Let's finish by looking at the VOOM-transformed data.

In [18]:
head(tbl.all[,c(1,5,9,13,17,18)],10)
cor(tbl.all[c(5,9,13,17,18)])
genelasso.voombs.voomrf.voomsqrt.lasso.voomgene.cor
9HLF 0.24105628 3.468772e-0153.0524708 0.15897283 0.9232394
5STAT4 0.09551869 1.708533e-0148.9494236 0.17336532 0.9047599
35SATB1 0.00000000 -3.633166e-0527.9882393 0.00000000 0.8628270
15SATB2 0.26140384 2.853092e-0130.4646927 0.23814460 0.8423299
33ATF2 0.00000000 -5.996884e-06 7.3519720 0.00000000 0.8278686
4FOXP2 0.13231032 4.508482e-05 3.8044576 0.10469561 0.8196172
11TSHZ2 0.09676352 1.235520e-01 4.6044307 0.08923052 0.7704362
40DRGX 0.00000000 1.398560e-05 8.6275128 0.00000000 0.7309450
1HDX 0.00000000 5.735278e-05 0.8661305 0.00000000 0.7241906
8LHX6 0.09315715 1.365092e-01 2.2815335 0.09288872 0.7146115
lasso.voombs.voomrf.voomsqrt.lasso.voomgene.cor
lasso.voom1.00000000.74145610.61955940.91214950.6367442
bs.voom0.74145611.00000000.68384550.68332420.5055787
rf.voom0.61955940.68384551.00000000.64444180.5682585
sqrt.lasso.voom0.91214950.68332420.64444181.00000000.6647592
gene.cor0.63674420.50557870.56825850.66475921.0000000

In this case, we see the strongest overall agreement between solvers. This is mainly driven by heightened correlation between Bayes Spike and the other methods. The two LASSO methods still have the highest agreement with the Pearson correlations between target gene and transcription factors, but they are not as strong as in the other transformations.

At this point, we've delved into a whole bunch of information and pulled out lots of correlations between the different transformations, different solvers, and the gene correlations coefficients. So let's step back and sum up what we've seen.

Takeaways

Based upon this short data exploration, there are a few key points to take away:

  1. Data transformation can profoundly affect results. There is no guarantee of consistency across different transformations, even for the same solvers. Furthermore, different transformations can affect which solvers agree with one another.

  2. Random Forest scores are the most consistent across different transformations, so the distribution of the data may be less of a concern if using this solver. LASSO and Square Root LASSO also show strong agreement between the transformed cases, but not compared to the as-is data.

  3. Regardless of transformation, LASSO coefficients and Random Forest scores tend to have stronger correlation with gene correlations than do Bayes Spike coefficients.

Though it is tempting to recommend a particular solver and/or transformation to use in every case, this is not a realistic outcome of the exploration done here. Rather, these data should serve as an example of the benefits and consequences involved with these choices, as we have demonstrated that choice of solver and transformation have a sizable impact on the final results.

Perhaps the best summation of this foray into different transformations is found in the VOOM paper itself:

"...correct modeling of the mean-variance relationship inherent in a data generating process is the key to designing statistically powerful methods of analysis."

Thus, transformation techniques such as VOOM that take into account the mean-variance relationship represent a standard way for preparing robust data for use in tools such as TReNA.

In [ ]: