Regulome Explorer 2.X¶

Comparison of the average gene expression of two groups based on clinical data (catageorical)

Authenticate with Google (IMPORTANT)¶

Our first step is to authenticate with Google -- you will need to be a member of a Google Cloud Platform (GCP) project, with authorization to run BigQuery jobs in order to run this notebook. If you don't have access to a GCP project, please contact the ISB-CGC team for help (www.isb-cgc.org)

Import Python libraries¶

In [160]:

from google.cloud import bigquery
import numpy as np
import pandas as pd
from scipy import stats

Python convenience functions¶

runQuery: a relatively generic BigQuery query-execution wrapper function which can be used to run a query in "dry-run" mode or not: the call to the query() function itself is inside a try/except block and if it fails we return None; otherwise a "dry" will return an empty dataframe, and a "live" run will return the query results as a dataframe. This function was modify from previous notebooks to handle user-defined parameteres necessary for the purpose of this notbeook.

In [161]:

def runQuery ( client, qString, ParameterList1, ParameterList2, ParameterList3, dryRun=False ):
  
  print ( "\n in runQuery ... " )
  if ( dryRun ):
    print ( "    dry-run only " )
    
  ## set up QueryJobConfig object
  job_config = bigquery.QueryJobConfig()
    
  query_params = [
        bigquery.ArrayQueryParameter('PARAMETERLIST1', 'STRING', ParameterList1 ),
        bigquery.ArrayQueryParameter('PARAMETERLIST2', 'STRING', ParameterList2 ),
        bigquery.ArrayQueryParameter('PARAMETERLIST3', 'STRING', ParameterList3 )
  ]
  job_config.query_parameters = query_params  
    
  job_config.dry_run = dryRun
  job_config.use_query_cache = True
  job_config.use_legacy_sql = False
  
  ## run the query
  try:
    query_job = client.query ( qString, job_config=job_config )
    ## print ( "    query job state: ", query_job.state )
  except:
    print ( "  FATAL ERROR: query execution failed " )
    return ( None )
  
  ## return results as a dataframe (or an empty dataframe for a dry-run) 
  if ( not dryRun ):
    try:
      df = query_job.to_dataframe()
      if ( query_job.total_bytes_processed==0 ):
        print ( "    the results for this query were previously cached " )
      else:
        print ( "    this query processed {} bytes ".format(query_job.total_bytes_processed) )
      if ( len(df) < 1 ):
        print ( "  WARNING: this query returned NO results ")
      return ( df )
    except:
      print ( "  FATAL ERROR: query execution failed " )
      return ( None )
    
  else:
    print ( "    if not cached, this query will process {} bytes ".format(query_job.total_bytes_processed) )
    ## return an empty dataframe
    return ( pd.DataFrame() )

SQL Building Functions¶

build_expdata. This function outputs a query string to create a table with three columns: patient ID, gene expresion, and gene name, for a user defined list of genes (PARAMETERLIST1) or all genes with gene expression data. Since a patient can have multiple samples, the gene expression value for each patient is averaged over the respective samples.

In [162]:

def build_expdata ( study ):
  qString = """
  expdata AS (
       SELECT      
           participantBarcode as patientid,
           Symbol,
           AVG( LOG10( normalized_count +1 )  ) as gene_exp
       FROM
          `pancancer-atlas.Filtered.EBpp_AdjustPANCAN_IlluminaHiSeq_RNASeqV2_genExp_filtered` 
       WHERE
           #study = '__study__' AND Symbol IN UNNEST(@PARAMETERLIST1)
           study = '__study__' AND Symbol IS NOT NULL
       GROUP BY ParticipantBarcode, Symbol
  )""".replace('__study__',study)
  return qString
    

build_statdata. This function outputs a query string to create a table with four columns: gene_name, category, n, sumx, and sumx2. n is the number of patients with the same clinical category, sumx (sumx2) is the sum of the gene expression (squared) data over all patients with the same category.

In [163]:

def build_statdata( category ):
  qString ="""
  statdata AS (
       SELECT
           A.Symbol,
           B.__category__ as category,
           COUNT( A.patientid ) as n,
           SUM( A.gene_exp ) as sumx,
           SUM( A.gene_exp * A.gene_exp ) as sumx2
       FROM
           expdata A
       INNER JOIN `pancancer-atlas.Filtered.clinical_PANCAN_patient_with_followup_filtered`  B
           ON A.patientid = B.bcr_patient_barcode
       GROUP BY
           B.__category__, A.Symbol     
  )""".replace('__category__', category)    
  return qString  

build_groupdate. Outputs a query string that summarize the statistical data for the desired catageory values that are specified for group1 and group 2. This catagory values are defined by a couple parameter lists, PARAMETERLIST2 and PARAMETERLIST3 for group 1 and group 2 respectively.

In [164]:

def build_groupdate():
  qString ="""
  group1 AS (
       SELECT Symbol, SUM( n ) as n, SUM( sumx ) as sumx, SUM( sumx2 ) as sumx2
       FROM statdata 
       WHERE category IN UNNEST(@PARAMETERLIST2)
       GROUP BY Symbol           
  ),
  group2 AS (
       SELECT Symbol, SUM( n ) as n, SUM( sumx ) as sumx, SUM( sumx2 ) as sumx2
       FROM statdata 
       WHERE category IN UNNEST(@PARAMETERLIST3)
       GROUP BY Symbol           
  )
  """  
  return qString

Start the analysis¶

In [165]:

# Start by getting authorized.
bqclient = bigquery.Client()

Specify Parameters¶

In [166]:

study = 'ACC' # Select Tumor type 
ClinicFeature1 = 'pathologic_stage'  # Clinical feature 
#gene_list =['IGF2','ADAM6','GRWD1','PATE3','CASP8'] #gene nems, irrelevant if all genes are evaluated
group1 = [ 'Stage I', 'Stage II']   # Category values to define group 1
group2 = ['Stage III', 'Stage IV']  # Category values to define group 1

Obtain data from Bigquery¶

We retreive data for t test evaluation. The ouput of the code is a dataframe with 7 columns corresponding to gene name (Symbol), the number of samples, average expression and variance of the two groups. This computation can take several seconds.

In [167]:

# Build the sql code
sql = (
"""WITH
"""+
build_expdata( study ) + ',\n' + 
build_statdata( ClinicFeature1 ) + ',\n' +
build_groupdate() + '\n' +    
"""
SELECT 
  A.Symbol,
  A.n as n1, (A.sumx/A.n) as avg1 , (A.sumx2 - A.sumx*A.sumx/A.n)/(A.n - 1) as var1 ,
  B.n as n2, (B.sumx/B.n) as avg2 , (B.sumx2 - B.sumx*B.sumx/B.n)/(B.n - 1) as var2
FROM 
  group1 A
INNER JOIN
  group2 B
ON   
  A.Symbol  = B.Symbol 
"""
)

# bigquery 
res0 = runQuery ( bqclient, sql, gene_list, group1, group2, dryRun=False )

 in runQuery ... 
    the results for this query were previously cached

We can now take a look the computed statistics. We have to make sure that there is not NUL values which may indicate that a group has no elements.

In [168]:

res0[0:10]

Out[168]:

	Symbol	n1	avg1	var1	n2	avg2	var2
0	TNIP2	46	2.721017	0.032786	30	2.724020	0.023377
1	AP1G1	46	3.370978	0.014340	30	3.346823	0.029890
2	C14orf178	46	0.162027	0.030331	30	0.176604	0.034447
3	UBASH3B	46	1.566646	0.122094	30	1.533723	0.125481
4	PFKFB4	46	2.533325	0.232271	30	2.699664	0.241280
5	POU3F2	46	0.649067	0.366423	30	0.808881	0.471779
6	LOC100132831	46	0.101804	0.019569	30	0.121566	0.025672
7	NKAP	46	2.806441	0.020916	30	2.789437	0.032605
8	C6orf162	46	2.164080	0.031893	30	2.145919	0.029427
9	CXCL10	46	1.336228	0.318461	30	1.342521	0.510212

Analyze the results¶

We can now use python function to the data in the resulting dataframe (res0). For example we can compute t-scores and p-values to determine if the difference between the mean of the two groups is statistically significant

In [169]:

# t statistics
f = lambda n1,n2,avg1,avg2,var1,var2  : (avg1 - avg2) / np.sqrt(var1/float(n1) + var2/float(n2) ) 
res0['tscore'] = res0.apply(lambda x: f(x.n1,x.n2,x.avg1,x.avg2,x.var1,x.var2), axis=1)

# p value 
f = lambda  n1,n2,tscore :  (1.0 -  stats.t.cdf(abs(tscore), n1 + n2 - 2 )) * 2.0 
res0['pvalue'] = res0.apply(lambda x: f(x.n1,x.n2,x.tscore), axis=1)

res0.sort_values( by= 'pvalue')[0:10]

/Users/borisaguilar/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in double_scalars

Out[169]:

	Symbol	n1	avg1	var1	n2	avg2	var2	tscore	pvalue
6740	DCAF15	46	2.791652	0.021481	30	3.001801	0.021450	-6.112562	4.211413e-08
2810	C14orf80	46	1.972242	0.065065	30	2.327349	0.060538	-6.061233	5.213308e-08
5810	HAUS8	46	2.022837	0.103465	30	2.398931	0.062825	-5.706652	2.243011e-07
11548	CENPJ	46	1.993653	0.035947	30	2.218451	0.026015	-5.536444	4.472172e-07
9086	DPF1	46	0.717971	0.338007	30	1.341653	0.161269	-5.529146	4.605700e-07
17069	MCM10	46	1.390464	0.201234	30	1.940207	0.175986	-5.432395	6.793837e-07
13006	PTCH2	46	1.237971	0.119995	30	1.734555	0.181423	-5.337440	9.924677e-07
17115	POLD1	46	2.596990	0.053311	30	2.857353	0.037465	-5.306079	1.124178e-06
11638	TRIP13	46	1.905145	0.174519	30	2.404135	0.152228	-5.298768	1.157266e-06
1901	UBE2S	46	2.703592	0.117995	30	3.069638	0.069413	-5.240540	1.457236e-06

We can plot the data

In [170]:

res0['tscore'].plot.kde()

Out[170]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a256c1908>

In [ ]: