Check out more notebooks at our 'Regulome Explorer Repository'!
Title: All pairwise associations between Gene Expression and MicroRNA data in TCGA
Author: Boris Aguilar
Created: 03-11-2020
Purpose: Demonstrate how to use the Regulome explorer functionality to find significant associations from Gene expression and microRNA BigQuery tables
URL: https://github.com/isb-cgc/Community-Notebooks/blob/master/Correlation-GeneExpression-MicroRNA
In this notebook, we use BigQuery to find significant correlations between all possible pairs of genes in the RNAseq data and unique identifiers in the microRNA data available in TCGA. The cohort for this analysis consists of BRCA patients that are 50 years old or younger at the time of diagnosis and Stage IIA as pathological state.
We used Bigquery to compute Pearson correlations and identify significant correlation coefficients using a BigQuery user-defined function. The necessary queries are generated using Python functions implemented in the Regulome Explorer module.
The first step is to authorize access to BigQuery and the Google Cloud. For more information see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found here.
%matplotlib inline
from google.cloud import bigquery
from scipy import stats
import re_module.bq_functions as regulome
import numpy as np
In this experiment we will use Gene Expression data and microRNA data available in ISB-CGC. The information about the Bigquery tables and their columns are stored in the Features dictionary. The significance level and the minimum number of patients for the correlation analysis are free parameters that can be changed.
If you want to test this notebook checking all possible pairs please replace "labels1 = ..." with "labels1 = ' IS NOT NULL '" but know the computations will take several minutes to complete; we recommend to generate the query and copy it in the console.
# information from Bigquery tables
Features = { 'Gene Expression' : { 'table' : 'isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression',
'symbol' : 'gene_name',
'study' : 'project_short_name',
'data' : 'AVG( LOG10( HTSeq__Counts + 1 ) ) ',
'rnkdata': 'data',
'avgdat' : 'avgdata',
'patientcode': 'case_barcode',
'samplecode': 'sample_barcode',
'where' : 'AND HTSeq__Counts IS NOT NULL',
'dattype': 'numeric' },
'MicroRNA Expression': { 'table' : 'isb-cgc.TCGA_hg38_data_v0.miRNAseq_Expression',
'symbol' : 'mirna_id',
'study' : 'project_short_name',
'data' : 'AVG( reads_per_million_miRNA_mapped )',
'rnkdata': 'data',
'avgdat' : 'avgdata',
'patientcode': 'case_barcode',
'samplecode': 'sample_barcode',
'where' : 'AND reads_per_million_miRNA_mapped IS NOT NULL',
'dattype': 'numeric'}
}
# parameteres for Gene expression data
feat1 = Features['Gene Expression']
cohort1 = feat1['patientcode'] + " IN ( SELECT case_barcode FROM cohort ) "
labels1 = ' = \'RNU7-6P\' ' #' IS NOT NULL ' use this for all pair-wise comparison; or Change RNU7-6P for your favorite gene
#labels1 = ' IS NOT NULL '
# parameteres for MicroRNA expression data
feat2 = Features['MicroRNA Expression']
cohort2 = feat2['patientcode'] + " IN ( SELECT case_barcode FROM cohort ) "
labels2 = ' IS NOT NULL '
# significance level
significance_level = 0.001 # only 0.05, 0.01, 0.005, and 0.001 are allowed !!
# minimun correlation
r_treshold = 0.3
# minimun number of samples
nsamples = 25
The cohort for this analysis consists of BRCA patients that 50 years old or younger at the time of diagnosis and Stage IIA as pathological state.
cohort = """
cohort AS(
SELECT case_barcode FROM `isb-cgc.TCGA_bioclin_v0.Clinical`
WHERE project_short_name = "TCGA-BRCA" AND age_at_diagnosis <= 50
AND pathologic_stage = 'Stage IIA'
)
"""
The queries to retrieve the necessary data, query_t1 for Gene expression and query_t2 for micro RNA, are generated using a function available in the regulome explorer module available in here https://github.com/isb-cgc/Community-Notebooks/tree/master/RegulomeExplorer/re_module.
query_t1 = regulome.generic_numeric_bqtable ( 'table1' , feat1, cohort1, labels1 )
query_t2 = regulome.generic_numeric_bqtable ( 'table2' , feat2, cohort2, labels2 )
The two tables are integrated into one table to compute correlations. The query to perform these operations is obtained by using a function implemented in the regulome explorer module. 'sql' contains the complete query to perform the correlations. You can copy the query (which is printed ) to the Bigquery - Google cloud console to run the query and obtain the results. The computations can take 10 minutes. Alternatively you can go to "Run the Query" section of this notebook.
Note: isb-cgc-bq.functions.significance_level_ttest2_current
is a persistent function that estimates upper bounds of p-values ( significance levels ), so we can filter out correlations that are not statistically significant ( p-value > significance_level ). This function only works for N < 500. For N>500 we recommend the normal approximation of the t distribution. cgc-05-0042.functions.jstat_ttest
is a persistent function that uses the jstat statistical library to compute pvalues.
summ_query = regulome.get_summarized_table('Gene Expression',feat1,'MicroRNA Expression',feat2)
sql = ( 'WITH' + cohort + ',' + query_t1 + ',' + query_t2 + ',' + summ_query + """
SELECT symbol1, symbol2, n, correlation,
`cgc-05-0042.functions.jstat_ttest`(t, n-2, 2) as p
FROM (
SELECT *,
ABS(correlation)*SQRT( (n-2)/((1-correlation*correlation))) AS t
FROM summ_table
WHERE n > {0} AND ABS(correlation) >= {1} AND ABS(correlation) < 1.0
)
WHERE `isb-cgc-bq.functions.significance_level_ttest2_current`(n-2, t) <= {2}
""".format( str(nsamples), str(r_treshold), str(significance_level) ) )
print( sql ) # This Query performs the computations
WITH cohort AS( SELECT case_barcode FROM `isb-cgc.TCGA_bioclin_v0.Clinical` WHERE project_short_name = "TCGA-BRCA" AND age_at_diagnosis <= 50 AND pathologic_stage = 'Stage IIA' ) , table1 AS ( SELECT symbol, data AS rnkdata, ParticipantBarcode FROM ( SELECT gene_name AS symbol, AVG( LOG10( HTSeq__Counts + 1 ) ) AS data, case_barcode AS ParticipantBarcode FROM `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression` WHERE case_barcode IN ( SELECT case_barcode FROM cohort ) # cohort AND gene_name = 'RNU7-6P' # labels AND HTSeq__Counts IS NOT NULL GROUP BY ParticipantBarcode, symbol ) ) , table2 AS ( SELECT symbol, data AS rnkdata, ParticipantBarcode FROM ( SELECT mirna_id AS symbol, AVG( reads_per_million_miRNA_mapped ) AS data, case_barcode AS ParticipantBarcode FROM `isb-cgc.TCGA_hg38_data_v0.miRNAseq_Expression` WHERE case_barcode IN ( SELECT case_barcode FROM cohort ) # cohort AND mirna_id IS NOT NULL # labels AND reads_per_million_miRNA_mapped IS NOT NULL GROUP BY ParticipantBarcode, symbol ) ) , summ_table AS ( SELECT n1.symbol as symbol1, n2.symbol as symbol2, COUNT( n1.ParticipantBarcode ) as n, CORR(n1.rnkdata , n2.rnkdata) as correlation FROM table1 AS n1 INNER JOIN table2 AS n2 ON n1.ParticipantBarcode = n2.ParticipantBarcode GROUP BY symbol1, symbol2 ) SELECT symbol1, symbol2, n, correlation, `cgc-05-0042.functions.jstat_ttest`(t, n-2, 2) as p FROM ( SELECT *, ABS(correlation)*SQRT( (n-2)/((1-correlation*correlation))) AS t FROM summ_table WHERE n > 25 AND ABS(correlation) >= 0.3 AND ABS(correlation) < 1.0 ) WHERE `isb-cgc-bq.functions.significance_level_ttest2_current`(n-2, t) <= 0.001
The following steps run BigQuery and generate a table with significant associations. Note: alpha is an upper bound of the significant level.
bqclient = bigquery.Client()
df_results = regulome.runQuery ( bqclient, sql, [], [], [], dryRun=False )
df_results[0:19]
in runQuery ... this query processed 22617633294 bytes Approx. elpased time : 2204 miliseconds
symbol1 | symbol2 | n | correlation | p | |
---|---|---|---|---|---|
0 | RNU7-6P | hsa-mir-330 | 94 | 0.620962 | 2.575092e-11 |
1 | RNU7-6P | hsa-mir-4731 | 94 | 0.434323 | 1.235507e-05 |
2 | RNU7-6P | hsa-mir-3177 | 94 | 0.437327 | 1.058525e-05 |
3 | RNU7-6P | hsa-mir-4326 | 94 | 0.397826 | 7.224869e-05 |
4 | RNU7-6P | hsa-mir-769 | 94 | 0.365510 | 2.934017e-04 |
5 | RNU7-6P | hsa-mir-7977 | 94 | 0.465289 | 2.334772e-06 |
6 | RNU7-6P | hsa-mir-6862-2 | 94 | 0.610388 | 6.839972e-11 |
7 | RNU7-6P | hsa-mir-1283-2 | 94 | 0.429837 | 1.552192e-05 |
8 | RNU7-6P | hsa-mir-6890 | 94 | 0.949620 | 7.826700e-48 |
9 | RNU7-6P | hsa-mir-1291 | 94 | 0.569515 | 2.173040e-09 |
10 | RNU7-6P | hsa-mir-1915 | 94 | 0.805884 | 1.436690e-22 |
11 | RNU7-6P | hsa-mir-603 | 94 | 0.524834 | 5.773390e-08 |
12 | RNU7-6P | hsa-mir-548at | 94 | 0.370178 | 2.417717e-04 |
13 | RNU7-6P | hsa-mir-6799 | 94 | 0.392695 | 9.113830e-05 |
14 | RNU7-6P | hsa-mir-7641-1 | 94 | 0.752305 | 2.662407e-18 |
15 | RNU7-6P | hsa-mir-4435-2 | 94 | 0.373747 | 2.080951e-04 |
16 | RNU7-6P | hsa-mir-4502 | 94 | 0.497290 | 3.487254e-07 |
17 | RNU7-6P | hsa-mir-4770 | 94 | 0.536321 | 2.598841e-08 |
18 | RNU7-6P | hsa-mir-4764 | 94 | 0.357585 | 4.048466e-04 |
Note that the nonsignificant correlations ( p > significance level ) were filtered out. We can also use Python to compute p values from correlations and number of patients:
if not df_results.empty:
f = lambda n, correlation : (1.0 - stats.t.cdf( abs(correlation) * np.sqrt( (n- 2.0) / (1.0 - correlation**2 )), n-2)) * 2.0
df_results['p-python'] = df_results.apply(lambda x: f(x.n,x.correlation), axis=1)
df_results[0:19]
symbol1 | symbol2 | n | correlation | p | p-python | |
---|---|---|---|---|---|---|
0 | RNU7-6P | hsa-mir-330 | 94 | 0.620962 | 2.575092e-11 | 2.441558e-11 |
1 | RNU7-6P | hsa-mir-4731 | 94 | 0.434323 | 1.235507e-05 | 1.221609e-05 |
2 | RNU7-6P | hsa-mir-3177 | 94 | 0.437327 | 1.058525e-05 | 1.046278e-05 |
3 | RNU7-6P | hsa-mir-4326 | 94 | 0.397826 | 7.224869e-05 | 7.167805e-05 |
4 | RNU7-6P | hsa-mir-769 | 94 | 0.365510 | 2.934017e-04 | 2.917436e-04 |
5 | RNU7-6P | hsa-mir-7977 | 94 | 0.465289 | 2.334772e-06 | 2.299931e-06 |
6 | RNU7-6P | hsa-mir-6862-2 | 94 | 0.610388 | 6.839972e-11 | 6.511769e-11 |
7 | RNU7-6P | hsa-mir-1283-2 | 94 | 0.429837 | 1.552192e-05 | 1.535453e-05 |
8 | RNU7-6P | hsa-mir-6890 | 94 | 0.949620 | 7.826700e-48 | 0.000000e+00 |
9 | RNU7-6P | hsa-mir-1291 | 94 | 0.569515 | 2.173040e-09 | 2.096398e-09 |
10 | RNU7-6P | hsa-mir-1915 | 94 | 0.805884 | 1.436690e-22 | 0.000000e+00 |
11 | RNU7-6P | hsa-mir-603 | 94 | 0.524834 | 5.773390e-08 | 5.630438e-08 |
12 | RNU7-6P | hsa-mir-548at | 94 | 0.370178 | 2.417717e-04 | 2.403356e-04 |
13 | RNU7-6P | hsa-mir-6799 | 94 | 0.392695 | 9.113830e-05 | 9.045473e-05 |
14 | RNU7-6P | hsa-mir-7641-1 | 94 | 0.752305 | 2.662407e-18 | 0.000000e+00 |
15 | RNU7-6P | hsa-mir-4435-2 | 94 | 0.373747 | 2.080951e-04 | 2.068116e-04 |
16 | RNU7-6P | hsa-mir-4502 | 94 | 0.497290 | 3.487254e-07 | 3.418562e-07 |
17 | RNU7-6P | hsa-mir-4770 | 94 | 0.536321 | 2.598841e-08 | 2.528230e-08 |
18 | RNU7-6P | hsa-mir-4764 | 94 | 0.357585 | 4.048466e-04 | 4.027463e-04 |