Check out more notebooks at our 'Regulome Explorer Repository'!
In this notebook we describe the implementation of a student's t test to compute the significance of associations between a numerical feature (Gene expression, Somatic copy number, etc.) and Somatic mutation data which can be categorized into two groups according to the presence or absence of somatic mutations in a user defined gene. Detail of the test can be found the following link: https://en.wikipedia.org/wiki/Welch%27s_t-test
To describe the implementation of the test using BigQuery, we will use Gene expresion data of a user defined gene and the precense or absence of somatic mutation of a user defined gene. This data is read from a BigQuery table in the pancancer-atlas dataset.
The first step is to authorize access to BigQuery and the Google Cloud. For more information see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found here.
from google.cloud import bigquery
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import mstats
import seaborn as sns
import re_module.bq_functions as regulome
The parameters for this experiment are the cancer type, the name of gene for which gene expression data will be obtained, and the clinical feature name. Categorical groups with number of samples smaller than 'MinSampleSize' will be ignored in the test.
cancer_type = 'LGG'
gene_expre = 'DRG2'
gene_mutation = 'TP53'
MinSampleSize = 10
bqclient = bigquery.Client()
The first step is to select all participants in the selected study with a least one mutation.
barcode_set = """barcodes AS (
SELECT Tumor_SampleBarcode AS SampleBarcode
FROM `pancancer-atlas.Filtered.MC3_MAF_V5_one_per_tumor_sample`
WHERE Study = '{0}'
)
""".format(cancer_type)
Gene expression data from the BigQuery: The following query string retrieves the gene expression data of the user specified gene ('gene_expre') from the 'Filtered.EBpp_AdjustPANCAN_IlluminaHiSeq_RNASeqV2_genExp_filtered' table available in pancancer-atlas dataset. The gene expression of a participant is computed as the average gene expression of the tumor samples of the participant. Moreover, we are considering only samples with a least somatic mutation.
query_table1 = """table1 AS (
SELECT Symbol, data, ParticipantBarcode
FROM (
SELECT
Symbol AS symbol, AVG( LOG10( normalized_count + 1 )) AS data, ParticipantBarcode
FROM `pancancer-atlas.Filtered.EBpp_AdjustPANCAN_IlluminaHiSeq_RNASeqV2_genExp_filtered`
WHERE Study = '{0}' AND Symbol ='{1}' AND normalized_count IS NOT NULL
AND SampleBarcode IN (SELECT * FROM barcodes)
GROUP BY
ParticipantBarcode, symbol
)
)
""".format(cancer_type, gene_expre )
Somatic mutation data from the BigQuery: The following string query will retrieve a table with patients with at least one Somatic mutation in the user defined gene ('mutation_name'). This information is extracted from the 'pancancer-atlas.Filtered.MC3_MAF_V5_one_per_tumor_sample' table, available in pancancer-atlas dataset. Notice that we only use samples in which FILTER = 'PASS'.
query_table2 = """table2 AS (
SELECT
ParticipantBarcode
FROM
(
SELECT
ParticipantBarcode AS ParticipantBarcode
FROM `pancancer-atlas.Filtered.MC3_MAF_V5_one_per_tumor_sample`
WHERE Study = '{0}' AND Hugo_Symbol = '{1}'
AND FILTER = 'PASS'
GROUP BY ParticipantBarcode
)
)
""".format(cancer_type, gene_mutation )
At this point we can take a look at the combined data (Gene expression and Somatic mutation) by using a simple LEFT JOIN command. Participants with and without somatic mutations are labeled as 'YES' and 'NO' respectively.
sql_data = 'WITH\n'+ barcode_set+','+ query_table1+','+query_table2
sql = (sql_data +
"""
SELECT
n1.data as data1,
IF( n2.ParticipantBarcode is null, 'NO', 'YES') as data2,
n1.ParticipantBarcode
FROM
table1 n1
LEFT JOIN table2 n2
ON n1.ParticipantBarcode = n2.ParticipantBarcode
""")
df_data = regulome.runQuery ( bqclient, sql, [] , dryRun=False )
df_data[1:10]
in runQuery ... the results for this query were previously cached
data1 | data2 | ParticipantBarcode | |
---|---|---|---|
1 | 2.986545 | NO | TCGA-VW-A7QS |
2 | 2.398801 | YES | TCGA-DB-5280 |
3 | 2.513401 | YES | TCGA-HT-7611 |
4 | 3.063967 | NO | TCGA-DH-5141 |
5 | 2.549122 | NO | TCGA-HT-A616 |
6 | 2.769918 | NO | TCGA-HT-7694 |
7 | 2.527377 | YES | TCGA-CS-6290 |
8 | 2.832638 | NO | TCGA-HW-7495 |
9 | 2.373629 | YES | TCGA-WY-A85C |
To visualize the gene expression data in both groups with or without somatic mutation, we can use a 'violinplot' plot:
df_data.rename(columns={ "data1": gene_expre, "data2": gene_mutation }, inplace=True)
sns.violinplot( x=df_data[gene_mutation], y=df_data[gene_expre], palette="Pastel1")
<matplotlib.axes._subplots.AxesSubplot at 0x1a256eb2e8>
The T-score (T), assuming the distributions of gene expression with and without mutation have unequal variances, is computed by using the following equation:
$$T = \frac{ \bar{g}_y - \bar{g}_n }{\sqrt{ \frac{s_y^2}{N_y} + \frac{s_n^2}{N_n} } }$$where
Since the Somatic mutation table contains information of positive somatic mutation only, the averages and variances needed to compute the $T$ score are computed as a function $S_y=\sum_{i=1}^{N_y}{g_i}$ and $Q_y=\sum_{i=1}^{N_y}{g_i^2}$, the summs over the gene expression and squared gene expression for articipants with somatic mutation. The following query string computes $S_y$ and $Q_y$:
summ_table = """
summ_table AS (
SELECT
Symbol,
COUNT( n1.ParticipantBarcode) as Ny,
SUM( n1.data ) as Sy,
SUM( n1.data * n1.data ) as Qy
FROM
table1 AS n1
INNER JOIN
table2 AS n2
ON
n1.ParticipantBarcode = n2.ParticipantBarcode
GROUP BY Symbol
)
"""
After computing $S_y$ and $Q_y$ we can compute the mean and the variance as : $$\bar{g}_y =\frac{S_y}{N_y}$$ $$s_y^2 = (N_y-1)^{-1} \left[ Q_y - \frac{S_y^2}{ N_y} \right]$$
To compute $S_n$ and $Q_n$, we first compute the sums of the gene expression and squeared gene expression using all the samples and then substract $S_y$ and $Q_y$. The following query uses this approach to compute the necessary variances and means, and then computes $T$ score. The $T$ score is only computed if the variances are greater than zero and if the number of participants in each group is greater than a user defined threshold.
query_tscore = """
SELECT
Ny, Nn,
avg_y, avg_n,
ABS(avg_y - avg_n)/ SQRT( var_y /Ny + var_n/Nn ) as tscore
FROM (
SELECT Ny,
Sy / Ny as avg_y,
( Qy - Sy*Sy/Ny )/(Ny - 1) as var_y,
Nt - Ny as Nn,
(St - Sy)/(Nt - Ny) as avg_n,
(Qt - Qy - (St-Sy)*(St-Sy)/(Nt - Ny) )/(Nt - Ny -1 ) as var_n
FROM summ_table as n1
LEFT JOIN ( SELECT Symbol, COUNT( ParticipantBarcode ) as Nt, SUM( data ) as St, SUM( data*data ) as Qt
FROM table1 GROUP BY Symbol
) as n2
ON n1.Symbol = n2.Symbol
)
WHERE
Ny > {0} AND Nn > {0} AND var_y > 0 and var_n > 0
""".format(str(MinSampleSize))
sql = ( sql_data + ',\n' + summ_table + query_tscore )
df_tscore = regulome.runQuery ( bqclient, sql, [] , dryRun=False )
df_tscore
in runQuery ... the results for this query were previously cached
Ny | Nn | avg_y | avg_n | tscore | |
---|---|---|---|---|---|
0 | 248 | 261 | 2.495134 | 2.894686 | 24.426213 |
To test our implementation we can use the 'ttest_ins' function available in python:
print( df_data.groupby(gene_mutation).agg(['mean', 'count']) )
Set1 = df_data[df_data[gene_mutation]=='NO']
Set2 = df_data[df_data[gene_mutation]=='YES']
print( stats.ttest_ind(Set1[gene_expre], Set2[gene_expre], equal_var=False ) )
DRG2 mean count TP53 NO 2.894686 261 YES 2.495134 248 Ttest_indResult(statistic=24.426213134587776, pvalue=4.38554327085175e-84)