# Regulome Explorer T test for numerical and binary data¶

Check out more notebooks at our 'Regulome Explorer Repository'!

In this notebook we describe the implementation of a student's t test to compute the significance of associations between a numerical feature (Gene expression, Somatic copy number, etc.) and Somatic mutation data which can be categorized into two groups according to the presence or absence of somatic mutations in a user defined gene. Detail of the test can be found the following link: https://en.wikipedia.org/wiki/Welch%27s_t-test

To describe the implementation of the test using BigQuery, we will use Gene expresion data of a user defined gene and the precense or absence of somatic mutation of a user defined gene. This data is read from a BigQuery table in the pancancer-atlas dataset.

The first step is to authorize access to BigQuery and the Google Cloud. For more information see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found here.

#### Import Python libraries¶

In [1]:
from google.cloud import bigquery
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import mstats
import seaborn as sns
import re_module.bq_functions as regulome


## User defined Parameters¶

The parameters for this experiment are the cancer type, the name of gene for which gene expression data will be obtained, and the clinical feature name. Categorical groups with number of samples smaller than 'MinSampleSize' will be ignored in the test.

In [2]:
cancer_type = 'LGG'
gene_expre = 'DRG2'
gene_mutation = 'TP53'
MinSampleSize = 10

bqclient = bigquery.Client()


## Data from BigQuery tables¶

The first step is to select all participants in the selected study with a least one mutation.

In [3]:
barcode_set = """barcodes AS (
SELECT Tumor_SampleBarcode AS SampleBarcode
FROM pancancer-atlas.Filtered.MC3_MAF_V5_one_per_tumor_sample
WHERE Study = '{0}'
)
""".format(cancer_type)


Gene expression data from the BigQuery: The following query string retrieves the gene expression data of the user specified gene ('gene_expre') from the 'Filtered.EBpp_AdjustPANCAN_IlluminaHiSeq_RNASeqV2_genExp_filtered' table available in pancancer-atlas dataset. The gene expression of a participant is computed as the average gene expression of the tumor samples of the participant. Moreover, we are considering only samples with a least somatic mutation.

In [4]:
query_table1 = """table1 AS (
SELECT Symbol, data, ParticipantBarcode
FROM (
SELECT
Symbol AS symbol, AVG( LOG10( normalized_count + 1 )) AS data, ParticipantBarcode
FROM  pancancer-atlas.Filtered.EBpp_AdjustPANCAN_IlluminaHiSeq_RNASeqV2_genExp_filtered
WHERE Study = '{0}' AND Symbol ='{1}' AND normalized_count IS NOT NULL
AND SampleBarcode  IN (SELECT * FROM barcodes)

GROUP BY
ParticipantBarcode, symbol
)
)
""".format(cancer_type, gene_expre )


Somatic mutation data from the BigQuery: The following string query will retrieve a table with patients with at least one Somatic mutation in the user defined gene ('mutation_name'). This information is extracted from the 'pancancer-atlas.Filtered.MC3_MAF_V5_one_per_tumor_sample' table, available in pancancer-atlas dataset. Notice that we only use samples in which FILTER = 'PASS'.

In [5]:
query_table2 = """table2 AS (
SELECT
ParticipantBarcode
FROM
(
SELECT
ParticipantBarcode AS ParticipantBarcode
FROM pancancer-atlas.Filtered.MC3_MAF_V5_one_per_tumor_sample
WHERE Study = '{0}' AND Hugo_Symbol = '{1}'
AND FILTER = 'PASS'
GROUP BY ParticipantBarcode
)
)
""".format(cancer_type, gene_mutation )


At this point we can take a look at the combined data (Gene expression and Somatic mutation) by using a simple LEFT JOIN command. Participants with and without somatic mutations are labeled as 'YES' and 'NO' respectively.

In [6]:
sql_data = 'WITH\n'+ barcode_set+','+ query_table1+','+query_table2

sql = (sql_data +
"""
SELECT
n1.data as data1,
IF( n2.ParticipantBarcode is null, 'NO', 'YES') as data2,
n1.ParticipantBarcode
FROM
table1 n1
LEFT JOIN  table2   n2
ON  n1.ParticipantBarcode = n2.ParticipantBarcode
""")

df_data = regulome.runQuery ( bqclient, sql, [] , dryRun=False )
df_data[1:10]

 in runQuery ...
the results for this query were previously cached

Out[6]:
data1 data2 ParticipantBarcode
1 2.986545 NO TCGA-VW-A7QS
2 2.398801 YES TCGA-DB-5280
3 2.513401 YES TCGA-HT-7611
4 3.063967 NO TCGA-DH-5141
5 2.549122 NO TCGA-HT-A616
6 2.769918 NO TCGA-HT-7694
7 2.527377 YES TCGA-CS-6290
8 2.832638 NO TCGA-HW-7495
9 2.373629 YES TCGA-WY-A85C

To visualize the gene expression data in both groups with or without somatic mutation, we can use a 'violinplot' plot:

In [8]:
df_data.rename(columns={ "data1": gene_expre, "data2": gene_mutation }, inplace=True)
sns.violinplot( x=df_data[gene_mutation], y=df_data[gene_expre], palette="Pastel1")

Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a256eb2e8>

## BigQuery to Compute statistical association¶

The T-score (T), assuming the distributions of gene expression with and without mutation have unequal variances, is computed by using the following equation:

$$T = \frac{ \bar{g}_y - \bar{g}_n }{\sqrt{ \frac{s_y^2}{N_y} + \frac{s_n^2}{N_n} } }$$

where

• $\bar{g}_y$ and $\bar{g}_n$ are mean gene expression of participants with and without mutation.
• $N_y$ and $N_n$ are the number of participants in the group with and without mutation.
• $s_y^2$ and $s_n^2$ are the variance of gene expression for the participants with and without mutation, respectively.

Since the Somatic mutation table contains information of positive somatic mutation only, the averages and variances needed to compute the $T$ score are computed as a function $S_y=\sum_{i=1}^{N_y}{g_i}$ and $Q_y=\sum_{i=1}^{N_y}{g_i^2}$, the summs over the gene expression and squared gene expression for articipants with somatic mutation. The following query string computes $S_y$ and $Q_y$:

In [9]:
summ_table = """
summ_table AS (
SELECT
Symbol,
COUNT( n1.ParticipantBarcode) as Ny,
SUM( n1.data )  as Sy,
SUM( n1.data * n1.data ) as Qy

FROM
table1 AS n1
INNER JOIN
table2 AS n2
ON
n1.ParticipantBarcode = n2.ParticipantBarcode
GROUP BY Symbol
)
"""


After computing $S_y$ and $Q_y$ we can compute the mean and the variance as : $$\bar{g}_y =\frac{S_y}{N_y}$$ $$s_y^2 = (N_y-1)^{-1} \left[ Q_y - \frac{S_y^2}{ N_y} \right]$$

To compute $S_n$ and $Q_n$, we first compute the sums of the gene expression and squeared gene expression using all the samples and then substract $S_y$ and $Q_y$. The following query uses this approach to compute the necessary variances and means, and then computes $T$ score. The $T$ score is only computed if the variances are greater than zero and if the number of participants in each group is greater than a user defined threshold.

In [10]:
query_tscore = """
SELECT
Ny, Nn,
avg_y, avg_n,
ABS(avg_y - avg_n)/ SQRT( var_y /Ny + var_n/Nn )  as tscore
FROM (
SELECT Ny,
Sy / Ny as avg_y,
( Qy - Sy*Sy/Ny )/(Ny - 1) as var_y,
Nt - Ny as Nn,
(St - Sy)/(Nt - Ny) as avg_n,
(Qt - Qy - (St-Sy)*(St-Sy)/(Nt - Ny) )/(Nt - Ny -1 ) as var_n
FROM  summ_table as n1
LEFT JOIN ( SELECT Symbol, COUNT( ParticipantBarcode ) as Nt, SUM( data ) as St, SUM( data*data ) as Qt
FROM table1 GROUP BY Symbol
) as n2
ON n1.Symbol = n2.Symbol
)
WHERE
Ny > {0} AND Nn > {0} AND var_y > 0 and var_n > 0
""".format(str(MinSampleSize))

sql = ( sql_data  + ',\n' + summ_table + query_tscore )
df_tscore = regulome.runQuery ( bqclient, sql, [] , dryRun=False )
df_tscore

 in runQuery ...
the results for this query were previously cached

Out[10]:
Ny Nn avg_y avg_n tscore
0 248 261 2.495134 2.894686 24.426213

To test our implementation we can use the 'ttest_ins' function available in python:

In [12]:
print( df_data.groupby(gene_mutation).agg(['mean', 'count']) )

Set1 = df_data[df_data[gene_mutation]=='NO']
Set2 = df_data[df_data[gene_mutation]=='YES']

print( stats.ttest_ind(Set1[gene_expre], Set2[gene_expre], equal_var=False ) )

          DRG2
mean count
TP53
NO    2.894686   261
YES   2.495134   248
Ttest_indResult(statistic=24.426213134587776, pvalue=4.38554327085175e-84)

In [ ]: