All pairwise associations between Gene Expression and MicroRNA data in TCGA

Check out more notebooks at our 'Regulome Explorer Repository'!

Title:   All pairwise associations between Gene Expression and MicroRNA data in TCGA
Author:  Boris Aguilar
Created: 03-11-2020
Purpose: Demonstrate how to use the Regulome explorer functionality to find significant associations from Gene expression and microRNA BigQuery tables
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Correlation-GeneExpression-MicroRNA

In this notebook, we use BigQuery to find significant correlations between all possible pairs of genes in the RNAseq data and unique identifiers in the microRNA data available in TCGA. The cohort for this analysis consists of BRCA patients that are 50 years old or younger at the time of diagnosis and Stage IIA as pathological state.

We used Bigquery to compute Pearson correlations and identify significant correlation coefficients using a BigQuery user-defined function. The necessary queries are generated using Python functions implemented in the Regulome Explorer module.

Authentication

The first step is to authorize access to BigQuery and the Google Cloud. For more information see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found here.

In [1]:
%matplotlib inline
from google.cloud import bigquery
from scipy import stats
import re_module.bq_functions as regulome
import numpy as np

Parameters

In this experiment we will use Gene Expression data and microRNA data available in ISB-CGC. The information about the Bigquery tables and their columns are stored in the Features dictionary. The significance level and the minimum number of patients for the correlation analysis are free parameters that can be changed.

If you want to test this notebook checking all possible pairs please replace "labels1 = ..." with "labels1 = ' IS NOT NULL '" but know the computations will take several minutes to complete; we recommend to generate the query and copy it in the console.

In [2]:
# information from Bigquery tables
Features = {     'Gene Expression' : { 'table'  : 'isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression',
                                       'symbol' : 'gene_name',
                                       'study'  : 'project_short_name',
                                       'data'   : 'AVG( LOG10( HTSeq__Counts + 1 ) ) ',
                                       'rnkdata': 'data',
                                       'avgdat' : 'avgdata',  
                                   'patientcode': 'case_barcode',
                                    'samplecode': 'sample_barcode',
                                       'where'  : 'AND HTSeq__Counts IS NOT NULL',
                                       'dattype': 'numeric' },
             'MicroRNA Expression': {  'table'  : 'isb-cgc.TCGA_hg38_data_v0.miRNAseq_Expression',
                                       'symbol' : 'mirna_id',
                                        'study' : 'project_short_name',
                                       'data'   : 'AVG( reads_per_million_miRNA_mapped )',
                                       'rnkdata': 'data',
                                       'avgdat' : 'avgdata',
                                   'patientcode': 'case_barcode',
                                    'samplecode': 'sample_barcode',
                                       'where'  : 'AND reads_per_million_miRNA_mapped IS NOT NULL',
                                       'dattype': 'numeric'}
                
               }

# parameteres for Gene expression data
feat1 = Features['Gene Expression']
cohort1 = feat1['patientcode'] + " IN ( SELECT case_barcode FROM cohort ) "
labels1 = ' =  \'RNU7-6P\' '  #' IS NOT NULL ' use this for all pair-wise comparison; or Change RNU7-6P for your favorite gene 
#labels1 = ' IS NOT NULL '

# parameteres for MicroRNA expression data
feat2 = Features['MicroRNA Expression'] 
cohort2 = feat2['patientcode'] + " IN ( SELECT case_barcode FROM cohort ) "
labels2 = ' IS NOT NULL '

# significance level
significance_level = 0.001 # only 0.05, 0.01, 0.005, and 0.001 are allowed !!

# minimun correlation
r_treshold = 0.3

# minimun number of samples
nsamples = 25

Read cohort

The cohort for this analysis consists of BRCA patients that 50 years old or younger at the time of diagnosis and Stage IIA as pathological state.

In [3]:
cohort = """
cohort AS(
SELECT case_barcode FROM `isb-cgc.TCGA_bioclin_v0.Clinical`
WHERE project_short_name = "TCGA-BRCA" AND age_at_diagnosis <= 50
      AND pathologic_stage = 'Stage IIA'
)
"""

Queries to read tables

The queries to retrieve the necessary data, query_t1 for Gene expression and query_t2 for micro RNA, are generated using a function available in the regulome explorer module available in here https://github.com/isb-cgc/Community-Notebooks/tree/master/RegulomeExplorer/re_module.

In [4]:
query_t1 = regulome.generic_numeric_bqtable ( 'table1' , feat1, cohort1, labels1 )
query_t2 = regulome.generic_numeric_bqtable ( 'table2' , feat2, cohort2, labels2 ) 

Queries to join tables and perform statistics

The two tables are integrated into one table to compute correlations. The query to perform these operations is obtained by using a function implemented in the regulome explorer module. 'sql' contains the complete query to perform the correlations. You can copy the query (which is printed ) to the Bigquery - Google cloud console to run the query and obtain the results. The computations can take 10 minutes. Alternatively you can go to "Run the Query" section of this notebook.

Note: isb-cgc-bq.functions.significance_level_ttest2_current is a persistent function that estimates upper bounds of p-values ( significance levels ), so we can filter out correlations that are not statistically significant ( p-value > significance_level ). This function only works for N < 500. For N>500 we recommend the normal approximation of the t distribution. cgc-05-0042.functions.jstat_ttest is a persistent function that uses the jstat statistical library to compute pvalues.

In [5]:
summ_query = regulome.get_summarized_table('Gene Expression',feat1,'MicroRNA Expression',feat2)

sql = ( 'WITH' + cohort + ',' + query_t1 + ',' + query_t2 + ',' + summ_query + """
SELECT symbol1, symbol2, n, correlation,
   `cgc-05-0042.functions.jstat_ttest`(t, n-2, 2) as p
FROM (
   SELECT *, 
         ABS(correlation)*SQRT( (n-2)/((1-correlation*correlation))) AS t
   FROM summ_table
   WHERE n > {0} AND ABS(correlation) >= {1} AND ABS(correlation) < 1.0
     )
WHERE  `isb-cgc-bq.functions.significance_level_ttest2_current`(n-2, t) <= {2}
""".format( str(nsamples), str(r_treshold), str(significance_level) ) )

print( sql )  # This Query performs the computations
WITH
cohort AS(
SELECT case_barcode FROM `isb-cgc.TCGA_bioclin_v0.Clinical`
WHERE project_short_name = "TCGA-BRCA" AND age_at_diagnosis <= 50
      AND pathologic_stage = 'Stage IIA'
)
,
table1 AS (
SELECT
   symbol,
   data AS rnkdata,
   ParticipantBarcode
FROM (
   SELECT
      gene_name AS symbol, 
      AVG( LOG10( HTSeq__Counts + 1 ) )  AS data,
      case_barcode AS ParticipantBarcode
   FROM `isb-cgc.TCGA_hg38_data_v0.RNAseq_Gene_Expression`
   WHERE case_barcode IN ( SELECT case_barcode FROM cohort )     # cohort 
         AND gene_name  =  'RNU7-6P'   # labels 
         AND HTSeq__Counts IS NOT NULL  
   GROUP BY
      ParticipantBarcode, symbol
   )
)
,
table2 AS (
SELECT
   symbol,
   data AS rnkdata,
   ParticipantBarcode
FROM (
   SELECT
      mirna_id AS symbol, 
      AVG( reads_per_million_miRNA_mapped ) AS data,
      case_barcode AS ParticipantBarcode
   FROM `isb-cgc.TCGA_hg38_data_v0.miRNAseq_Expression`
   WHERE case_barcode IN ( SELECT case_barcode FROM cohort )     # cohort 
         AND mirna_id  IS NOT NULL   # labels 
         AND reads_per_million_miRNA_mapped IS NOT NULL  
   GROUP BY
      ParticipantBarcode, symbol
   )
)
,
summ_table AS (
SELECT 
   n1.symbol as symbol1,
   n2.symbol as symbol2,
   COUNT( n1.ParticipantBarcode ) as n,
   CORR(n1.rnkdata , n2.rnkdata) as correlation
    
FROM
   table1 AS n1
INNER JOIN
   table2 AS n2
ON
   n1.ParticipantBarcode = n2.ParticipantBarcode
   
GROUP BY
   symbol1, symbol2
)
SELECT symbol1, symbol2, n, correlation,
   `cgc-05-0042.functions.jstat_ttest`(t, n-2, 2) as p
FROM (
   SELECT *, 
         ABS(correlation)*SQRT( (n-2)/((1-correlation*correlation))) AS t
   FROM summ_table
   WHERE n > 25 AND ABS(correlation) >= 0.3 AND ABS(correlation) < 1.0
     )
WHERE  `isb-cgc-bq.functions.significance_level_ttest2_current`(n-2, t) <= 0.001

Run the Query

The following steps run BigQuery and generate a table with significant associations. Note: alpha is an upper bound of the significant level.

In [6]:
bqclient = bigquery.Client()
df_results = regulome.runQuery ( bqclient, sql, [], [], [], dryRun=False )
df_results[0:19]
 in runQuery ... 
    this query processed 22617633294 bytes 
    Approx. elpased time : 2204 miliseconds 
Out[6]:
symbol1 symbol2 n correlation p
0 RNU7-6P hsa-mir-330 94 0.620962 2.575092e-11
1 RNU7-6P hsa-mir-4731 94 0.434323 1.235507e-05
2 RNU7-6P hsa-mir-3177 94 0.437327 1.058525e-05
3 RNU7-6P hsa-mir-4326 94 0.397826 7.224869e-05
4 RNU7-6P hsa-mir-769 94 0.365510 2.934017e-04
5 RNU7-6P hsa-mir-7977 94 0.465289 2.334772e-06
6 RNU7-6P hsa-mir-6862-2 94 0.610388 6.839972e-11
7 RNU7-6P hsa-mir-1283-2 94 0.429837 1.552192e-05
8 RNU7-6P hsa-mir-6890 94 0.949620 7.826700e-48
9 RNU7-6P hsa-mir-1291 94 0.569515 2.173040e-09
10 RNU7-6P hsa-mir-1915 94 0.805884 1.436690e-22
11 RNU7-6P hsa-mir-603 94 0.524834 5.773390e-08
12 RNU7-6P hsa-mir-548at 94 0.370178 2.417717e-04
13 RNU7-6P hsa-mir-6799 94 0.392695 9.113830e-05
14 RNU7-6P hsa-mir-7641-1 94 0.752305 2.662407e-18
15 RNU7-6P hsa-mir-4435-2 94 0.373747 2.080951e-04
16 RNU7-6P hsa-mir-4502 94 0.497290 3.487254e-07
17 RNU7-6P hsa-mir-4770 94 0.536321 2.598841e-08
18 RNU7-6P hsa-mir-4764 94 0.357585 4.048466e-04

Note that the nonsignificant correlations ( p > significance level ) were filtered out. We can also use Python to compute p values from correlations and number of patients:

In [7]:
if not df_results.empty:
    f = lambda n, correlation  : (1.0 - stats.t.cdf( abs(correlation) * np.sqrt( (n- 2.0) / (1.0 - correlation**2 )), n-2)) * 2.0  
    df_results['p-python'] = df_results.apply(lambda x: f(x.n,x.correlation), axis=1)

df_results[0:19]
Out[7]:
symbol1 symbol2 n correlation p p-python
0 RNU7-6P hsa-mir-330 94 0.620962 2.575092e-11 2.441558e-11
1 RNU7-6P hsa-mir-4731 94 0.434323 1.235507e-05 1.221609e-05
2 RNU7-6P hsa-mir-3177 94 0.437327 1.058525e-05 1.046278e-05
3 RNU7-6P hsa-mir-4326 94 0.397826 7.224869e-05 7.167805e-05
4 RNU7-6P hsa-mir-769 94 0.365510 2.934017e-04 2.917436e-04
5 RNU7-6P hsa-mir-7977 94 0.465289 2.334772e-06 2.299931e-06
6 RNU7-6P hsa-mir-6862-2 94 0.610388 6.839972e-11 6.511769e-11
7 RNU7-6P hsa-mir-1283-2 94 0.429837 1.552192e-05 1.535453e-05
8 RNU7-6P hsa-mir-6890 94 0.949620 7.826700e-48 0.000000e+00
9 RNU7-6P hsa-mir-1291 94 0.569515 2.173040e-09 2.096398e-09
10 RNU7-6P hsa-mir-1915 94 0.805884 1.436690e-22 0.000000e+00
11 RNU7-6P hsa-mir-603 94 0.524834 5.773390e-08 5.630438e-08
12 RNU7-6P hsa-mir-548at 94 0.370178 2.417717e-04 2.403356e-04
13 RNU7-6P hsa-mir-6799 94 0.392695 9.113830e-05 9.045473e-05
14 RNU7-6P hsa-mir-7641-1 94 0.752305 2.662407e-18 0.000000e+00
15 RNU7-6P hsa-mir-4435-2 94 0.373747 2.080951e-04 2.068116e-04
16 RNU7-6P hsa-mir-4502 94 0.497290 3.487254e-07 3.418562e-07
17 RNU7-6P hsa-mir-4770 94 0.536321 2.598841e-08 2.528230e-08
18 RNU7-6P hsa-mir-4764 94 0.357585 4.048466e-04 4.027463e-04
In [ ]: