When employing TReNA, we are applying a solver to an assay matrix to infer relationships between transcription factors and target genes. Both the solver type and the assay matrix are mutable entities; there are currently 6 different solvers built into TReNA. The assay matrix can also be transformed in various ways. In this notebook, we will examine these two variables in an attempt to answer the following questions:

- How do the different solvers compare to one another on the same data?
- What is the effect of transforming the assay matrix in different ways?

To this end, we will be using the ampAD data of 154 transcription factors and 278 samples, with MEF2C as the target gene. Our three solvers will be LASSO, Random Forest, and Bayes Spike. The 4 different data transformations we will apply here are: as-is (unchanged), log2-transformed, asinh-transformed, and voom-transformed.

For this exercise, we will be using RNAseq data gathered for Alzheimer's disease (AD) and related diseases Original Paper Here. These expression data are from samples taken during brain autopsies from 4 groups of subjects:

- Normal/control group, with no discernable cognitive impairment and no diagnosis of any neurodegnerative disease
- AD group, with cognitive impairment and a diagnosis of AD
- Progressive supranuclear palsy (PSP) group, with cognitive impairment and a diagnosis of PSP
- Pathelogical aging, with the presence of amyloid plaques associated with a diseased state, but with no discernible cognitive impairment.

MEF2C is a gene identified through genome-wide association studies to be associated with protection from AD. To better understand which transcription factors are associated with MEF2C, and thus may be associated with AD, we will use TReNA to identify meaningful relationships.

Let us first get a handle on what our data look like under the 4 different transformations.

The "as-is" data are fairly straightforward; we have RNAseq data in units of RPKM with expression levels concentrated near 0, but the range of the data is quite large. This skewed dataset may not be ideal (though we don't exactly have a standard to compare to), so one way to adjust that skewed nature is to do a log2 transformation.

For each value in the matrix, the log2 transformation subtracts the minimum expression value in the matrix, adds a very small number (0.001) to prevent any attempt at taking log2(0), and then takes log2() of the value. As a result, the log2-transformed data have a much smaller range and appear much more Gaussian than the as-is data. They also cross over 0 into the negative range, a characteristic that differs from the original data.

The third data transformation is using asinh, the hyperbolic arcsine function. The actual math here is to simply take asinh() of every value in the matrix; the resulting distribution is scaled down, much like the log2-transformed data, but all values are positive.

The fourth and final transformation is VOOM transformation, a technique that estimates the mean-variance relationship in the data and uses it to weigh each observation. The transformation is performed using the

*voom()*function in the*limma*package from Biocondutor, which is documented here. The manuscript for the VOOM transformation can be found here.

For a better understanding of what these distributions look like, we can look at histograms of all 3:

In [1]:

```
# Load the assay matrix (mtx.sub) and transform it 3 ways
load("../extdata/ampAD.154genes.mef2cTFs.278samples.RData")
mtx.tmp <- mtx.sub - min(mtx.sub) + 0.001
mtx.log2 <- log2(mtx.tmp)
mtx.asinh <- asinh(mtx.sub)
suppressMessages(library(limma))
mtx.voom <- voom(mtx.sub)$E
# Plot them all
par(mfrow = c(4,1))
par(family = "sans")
hist(mtx.sub, main = "As-is Data")
hist(mtx.log2, main = "Log2-Transformed Data")
hist(mtx.asinh, main = "Asinh-Transformed Data")
hist(mtx.voom, main = "Voom-Transformed Data")
```

The data distributions are quite dissimilar, so it follows that we might expect different results when using them, even if we're using the same solver. To explicitly quantify these distributions, we can also take Tukey's five number summary (min,1st quartile, median, 3rd quartile, max) of each one:

In [2]:

```
cat("As-is :",fivenum(mtx.sub),"\n")
cat(" Log2 :",fivenum(mtx.log2),"\n")
cat("Asinh :",fivenum(mtx.asinh),"\n")
cat(" Voom :",fivenum(mtx.voom),"\n")
```

As we expected, the as-is data have a much wider distribution but are concentrated around small values. The log2 and asinh data both scale down the distribution in different ways, with the log2 transformation incoporating negative values and the asinh transformation finding its minimum at 0.

Now that we've examined the distributions fairly thoroughly, let us move to using the solvers themselves to infer relationships between transcription factors and target genes. This will give us a better idea of how our distributions affect the solutions we infer and what we might want to be careful of in the future

First things first, we'll go ahead and load our function; this function will do several things:

- Load the "as-is" data and create the 3 different matrices we've examined above
- For each of those 3 matrices, create and solve a TReNA object for each of 4 solvers (LASSO, Bayes Spike, Random Forest, Square Root LASSO) with MEF2C as the target gene. This means we'll have 12 total solutions.
- For each of the 12 solutions, pull out the top 10 genes, then take the union of those 9 sets of 10. This will give us a list of genes representing the best hits for each combination of distribution and solver
- Compile the scores of those genes from the 12 solutions into a unified table (tbl.all), along with the gene names and their Pearson correlations to MEF2C.

Let's do that and produce our table.

In [3]:

```
# Source the desired function and generate the table
suppressMessages(source("../utils/evaluateAllSolvers.R"))
tbl.all <- assess_ampAD154AllSolversAndDistributions()
```

Just to get an idea of what we're looking at, here's the table. We can browse through this to see what the data look like.

In [4]:

```
tbl.all
```

In [5]:

```
dim(tbl.all)
```

As we can see here, the union of top 10 genes from all solutions results in 40 genes. We can see that both LASSO methods are behaving much as we'd expect them, with some coefficients being shrunken to 0. We can see a little of what's going on with the scores in general--Random Forest looks to line up with itself regardless of distribution, whereas the others do not--but rather than draw conclusions based on a cursory look, let's examine the correlations more closely.

We'll take a look at the correlations between all different columns to give us a broad look at our table. We'll output these correlations as a plot of pairwise correlations.

In [6]:

```
# Plot all pairs
par(family = "sans")
pairs(tbl.all[2:18],
labels = names(tbl.all)[2:18])
```

A few things stand out here; first of all, the 4 different Random Forest distributions and the 4 different Square Root LASSO distributions look to be in very good agreement, producing the linear plots found near the bottom right corner. The other 2 solvers don't show nearly the same level of correlation between different distributions, though LASSO looks at least somewhat similar when comparing the log2 and asinh data. When it comes to the gene correlations themselves, LASSO looks to be the most similar, particularly for the asinh- transformed case.

This broad view can clue us in to some trends, but it's likely that we'll get much better resolution by looking more directly at smaller pieces of the data. Let us now zoom in to look more closely at how the different cases compare. We'll begin by looking at how different distributions affect results within the same solver, then shift to how different solvers affect results within the same distribution.

We are interested in the question: how does transforming the data affect the correlations between genes? We'll look at the top genes found by the solvers, sorted by their Pearson correlation with the target gene. First, let's look at LASSO:

In [7]:

```
head(tbl.all[,c(1,2,3,4,5,18)],10)
```

As we would expect from LASSO, many genes are missing coefficients because they've been shrunken out of the model. What we wouldn't necessarily expect is that the remaining coefficients don't necessarily line up well with one another. To quantify that, let us compare the LASSO coefficients between transformation types.

In [8]:

```
# For each pairwise comparison, find correlation
cor(tbl.all[2:5])
```

Correlation between the coefficients found for the "as-is" data with either transformed data type aren't particularly high, but transformed matrices are in much better agreement. We can similarly look at the results from using Bayes Spike as the solver:

In [9]:

```
head(tbl.all[,c(1,6:9,18)],10)
```

From visual inspection, these coefficients are a bit all over the map in terms of magnitude and sign. Additionally, most of the coefficients for the log-transformed, and many from the asinh-transformed, are quite small. Let's look at these correlations:

In [10]:

```
cor(tbl.all[6:9])
```

There appears to be very little agreement between the as-is Bayes Spike coefficients and the other transformations, but there's some agreement in the asinh and voom transformed datasets. This is somewhat similar to what we saw for the LASSO solver, though all correlations for the Bayes Spike coefficients are weaker.

Let's continue our survey of the solvers by looking at the Random Forest scores:

In [11]:

```
head(tbl.all[,c(1,10:13,18)],10)
```

This looks a lot more in line with what we would expect; the Random Forest scores look to decrease in size almost monotonically along with the Pearson correlations. Furthermore, they look to do so regardless of whether or not the data have been transformed, although the coefficient sizes vary quite a bit. Let's follow up with the correlations between different transformations.

In [12]:

```
cor(tbl.all[10:13])
```

The correlation coefficients between matrix types suggests very high agreement in all cases, and nearly identical trends when using log-transformed and asinh-transformed data. This is important information as we think about using Random Forest on other datasets; we can likely expect that regardless of transformation, we'll likely find similar results.

Let's finish our survey of solvers by looking at Square Root LASSO:

In [13]:

```
head(tbl.all[,c(1,14:17,18)],10)
```

Much like regular LASSO, we see many coefficients shrunken to 0 and there seems to be good agreement in which coefficients are 0 in the transformed cases, particularly between the asinh and voom transformations. We'll look at the correlations to get a better look

In [14]:

```
cor(tbl.all[14:17])
```

The as-is coefficients don't really line up with the others, but all 3 transformations are in close agreement here.

From looking at the within-solvers comparisons, it seems evident that we can probably expect some major differences between solver types. Let's look first at the as-is data across solvers; we'll look at the columns in a table and compute the correlation coefficients concurrently.

In [15]:

```
head(tbl.all[,c(1,2,6,10,14,18)],10)
cor(tbl.all[c(2,6,10,14,18)])
```

As we might expect, the 2 LASSO variations have the highest correlation, but otherwise there's not very good agreement between solvers. In terms of the gene correlations themselves, the Random Forest solver has the strongest agreement for the "as-is" data matrix, both LASSO variations are a little weaker, and the Bayes Spike shows virtually no correlation with the Pearson coefficients. Let us move on to the transformations to see what differences there are; we'll begin with log2-transformed data.

In [16]:

```
head(tbl.all[,c(1,3,7,11,15,18)],10)
cor(tbl.all[c(3,7,11,15,18)])
```

Once again, the 2 LASSO methods are in close agreement, this time agreeing nearly identically. We now also see much stronger agreement between all methods, with correlation coefficients in the 0.5-0.66 range. In contrast to the "as-is", the log2-transformed data demonstrate at least some correlation between each solver and the gene correlation coefficients, though the 2 LASSO methods have the highest coefficients. Now we'll look at the asinh-transformed data

In [17]:

```
head(tbl.all[,c(1,4,8,12,16,18)],10)
cor(tbl.all[c(4,8,12,16,18)])
```

Outside of the expected high agreement between LASSO methods, this transformation also shows fairly strong agreement between Random Forest and the LASSO methods. Bayes Spike is less correlated with the others, but still much more so than in the as-is case. Compared to the gene correlation coefficients, the 2 LASSO methods still show the strongest agreement and Bayes Spike has fallen off a bit. Let's finish by looking at the VOOM-transformed data.

In [18]:

```
head(tbl.all[,c(1,5,9,13,17,18)],10)
cor(tbl.all[c(5,9,13,17,18)])
```

In this case, we see the strongest overall agreement between solvers. This is mainly driven by heightened correlation between Bayes Spike and the other methods. The two LASSO methods still have the highest agreement with the Pearson correlations between target gene and transcription factors, but they are not as strong as in the other transformations.

At this point, we've delved into a whole bunch of information and pulled out lots of correlations between the different transformations, different solvers, and the gene correlations coefficients. So let's step back and sum up what we've seen.

Based upon this short data exploration, there are a few key points to take away:

Data transformation can profoundly affect results. There is no guarantee of consistency across different transformations, even for the same solvers. Furthermore, different transformations can affect which solvers agree with one another.

Random Forest scores are the most consistent across different transformations, so the distribution of the data may be less of a concern if using this solver. LASSO and Square Root LASSO also show strong agreement between the transformed cases, but not compared to the as-is data.

Regardless of transformation, LASSO coefficients and Random Forest scores tend to have stronger correlation with gene correlations than do Bayes Spike coefficients.

Though it is tempting to recommend a particular solver and/or transformation to use in every case, this is not a realistic outcome of the exploration done here. Rather, these data should serve as an example of the benefits and consequences involved with these choices, as we have demonstrated that choice of solver and transformation have a sizable impact on the final results.

Perhaps the best summation of this foray into different transformations is found in the VOOM paper itself:

"...correct modeling of the mean-variance relationship inherent in a data generating process is the key to designing statistically powerful methods of analysis."

Thus, transformation techniques such as VOOM that take into account the mean-variance relationship represent a standard way for preparing robust data for use in tools such as TReNA.

In [ ]:

```
```