Open In Colab

Example notebook exploring CPTAC protein abundances

Check out more notebooks at our Community Notebooks Repository!

Title:   Example notebook exploring CPTAC protein abundances
Author:  Boris Aguilar
Created: 01-19-2021
Purpose: Retrieve and analyze protein abundances from CPTAC
Notes:   This notebook recapitulates the following notebook https://pdc.cancer.gov/API_documentation/PDC_clustergram.html

The notebook extracts protein abundances from the CPTAC Clear cell renal cell carcinoma (CCRCC) quant and their associated clinical metadata from the publicly available BigQuery tables that the ISB-CGC project has produced based on CPTAC. Finally, the notebook clusters and visualizes the data using the Seaborn clustermap package.

Modules

In [56]:
from google.cloud import bigquery
from google.colab import auth
import seaborn as se
import pandas as pd
import pandas_gbq
import matplotlib.pyplot as plt
from scipy.stats import zscore

Defining helper functions

In [48]:
# A color mapping function for the clinical annotations
def get_colors(df, name, color) -> pd.Series:
    s = pd.Series( df[name]  ) 
    #s = df[name] 
    su = s.unique()
    colors = se.light_palette(color, len(su))
    lut = dict(zip(su, colors))
    return s.map(lut)

Google Authentication

The first step is to authorize access to BigQuery and the Google Cloud. For more information see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found here.

Moreover you need to create a google cloud project to be able to run BigQuery queries.

In [18]:
auth.authenticate_user()
my_project_id = "" # write your project id here

Fetch the data

The following code obtains protein abundances and clinical metada for the all the cases in the CPTAC CCRCC study. Specifically we join two tables quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current and clinical_CPTAC3_discovery_pdc_current that host protein abundances and clinical metada, respectively.

The results of query is automatically stored in pandas dataframe (quant_data) by the function read_gbq.

In [19]:
sql = '''
SELECT pg.aliquot_submitter_id, pg.gene_symbol, 
       CAST(pg.protein_abundance_log2ratio as FLOAT64) as log2ratio,
       clin.tumor_stage, clin.primary_diagnosis 
FROM `isb-cgc-bq.CPTAC.quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current` as pg
JOIN `isb-cgc-bq.CPTAC.clinical_CPTAC3_discovery_pdc_current` as clin
ON pg.case_id = clin.case_id
'''
quant_data = pandas_gbq.read_gbq(sql,project_id=my_project_id )
quant_data
Downloading: 100%|██████████| 1985337/1985337 [01:38<00:00, 20156.62rows/s]
Out[19]:
aliquot_submitter_id gene_symbol log2ratio tumor_stage primary_diagnosis
0 NCI7-2 COX1 -0.2728 None None
1 QC5 COX1 -0.7462 None None
2 QC4 COX1 -0.8352 None None
3 QC7 COX1 -0.6299 None None
4 QC8 COX1 -0.9439 None None
... ... ... ... ... ...
1985332 CPT0024680001 LZTR1 0.0672 Stage III Renal cell carcinoma, NOS
1985333 CPT0066430001 LZTR1 -0.2127 Stage III Renal cell carcinoma, NOS
1985334 CPT0009060003 LZTR1 -0.1471 Stage III Renal cell carcinoma, NOS
1985335 CPT0006730001 LZTR1 -0.0764 Stage III Renal cell carcinoma, NOS
1985336 CPT0024670003 LZTR1 -0.2035 Stage III Renal cell carcinoma, NOS

1985337 rows × 5 columns

Analysis

The clustermap module within the Seaborn package does not allow for NaN values. So we must create a mask value that does not interfere much with the clustering and is likely to be unique.

In [21]:
mask_na = 0.000666
quant_data = quant_data.fillna(mask_na)

We then need to pivot the data. This step transforms the data from tidy format used in BigQuery to the format required by the clustermap function ( aliquot_submitter_id as columns and gene_symbol as rows ).

In [14]:
ga = pd.pivot_table(quant_data,values='log2ratio', 
                               index='gene_symbol', 
                               columns='aliquot_submitter_id')
print(ga.shape)
(9591, 207)

Next we set up colors for the clinical features tumor_stage and primary_diagnosis

In [54]:
labels = quant_data[['aliquot_submitter_id','tumor_stage','primary_diagnosis']].drop_duplicates().set_index('aliquot_submitter_id')
stage_col_colors = get_colors(labels, 'tumor_stage', 'red')
diagnosis_col_colors = get_colors(labels, 'primary_diagnosis', 'green')

#combine the two series into a dataframe
color_bars = pd.concat([stage_col_colors, diagnosis_col_colors], axis=1)

Finally we generate the clustergram

In [53]:
se.clustermap(ga, metric='euclidean', method='complete', cmap='seismic', mask=ga == 0.000666, center=0.,
              figsize=(10, 10) ,  col_colors = color_bars )
plt.show() #12.5 50
/usr/local/lib/python3.7/dist-packages/seaborn/matrix.py:649: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)

The clustergram shows the protein expression data grouped according to aliquot_submitter_id (columns) and gene_symbol (rows).

You could also convert the log2 ratio data to a standard statistic, like z-score. This can help compress the range, accounting for outliers.

In [57]:
zdf = ga.T.apply(zscore, ddof=len(ga.columns)-1)
zdf = zdf.T

And examine clustering according to that transformation.

In [58]:
se.clustermap(zdf, metric='euclidean', method='complete', cmap='seismic', mask=ga == 0.000666, center=0.,
              figsize=(10, 10) ,  col_colors = color_bars )
plt.show() #12.5 50
/usr/local/lib/python3.7/dist-packages/seaborn/matrix.py:649: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)

This clustergram has a better grouping of the data. One can clearly distinguish a cluster of cell lines in the left hand side. Please compare this figure with the one generated by the PDC notebook in the following link https://pdc.cancer.gov/API_documentation/PDC_clustergram.html