Open In Colab

Comparing protein expression from different pipelines - CPTAC

Check out more notebooks at our Community Notebooks Repository!

Title:   Comparing protein expression from different pipelines 
Author:  Boris Aguilar
Created: 05-23-2021
Purpose: Compare proteomic expression from PDC and other pipelines available in the cptac library (https://github.com/PayneLab/cptac)
Notes: Runs in Google Colab

This notebook uses BigQuery to compare protein expression from the PDC and other pipelines. We used the cptac library to obtain protein expression derived from pipelines different than the one used by PDC.

Modules

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from google.cloud import bigquery
from google.colab import auth
import pandas_gbq

Google Authentication

The first step is to authorize access to BigQuery and the Google Cloud. For more information see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found here.

Moreover you need to create a google cloud project to be able to run BigQuery queries.

In [2]:
auth.authenticate_user()
my_project_id = "" # write your project id here
bqclient = bigquery.Client( my_project_id )

Install cptac library

In [3]:
try:
    import cptac
except ImportError:
    !pip install cptac --quiet
    import cptac
import cptac.utils as ut
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

Use the cptac library to download proteomic data of Lung adenocarcinoma (LUAD) and save it into a pandas dataframe.

In [4]:
cptac.download(dataset="Luad", version="latest")
ov = cptac.Luad()
df = ov.get_proteomics( )
df

Out[4]:
Name A1BG A2M AAAS AACS AADAC AADAT AAED1 AAGAB AAMDC AAMP AAR2 AARS AARS2 AARSD1 AASDHPPT AASS AATF AATK ABAT ABCA1 ABCA12 ABCA13 ABCA2 ABCA3 ABCA6 ABCA7 ABCA8 ABCB1 ABCB10 ABCB5 ABCB6 ABCB7 ABCB8 ABCC1 ABCC10 ABCC2 ABCC3 ABCC4 ABCC5 ... ZNF778 ZNF786 ZNF787 ZNF789 ZNF799 ZNF8 ZNF800 ZNF804A ZNF804B ZNF806 ZNF827 ZNF830 ZNF831 ZNF837 ZNF860 ZNF92 ZNF98 ZNHIT1 ZNHIT2 ZNHIT3 ZNHIT6 ZNRD1 ZNRF2 ZPR1 ZRANB2 ZRSR2 ZSCAN16 ZSCAN18 ZSCAN23 ZSCAN26 ZSCAN31 ZSWIM9 ZW10 ZWILCH ZWINT ZXDC ZYG11B ZYX ZZEF1 ZZZ3
Database_ID NP_570602.2 NP_000005.2|NP_001334353.1|NP_001334354.1|K4JDR8|K4JBA2|K4JB97 NP_056480.1|NP_001166937.1 NP_076417.2|NP_001306769.1|NP_001306768.1 NP_001077.2 NP_001273611.1|NP_001273612.1 NP_714542.1 NP_078942.3|NP_001258814.1 NP_001303889.1|NP_001350493.1|NP_001303886.1|NP_001303887.1 NP_001289474.1|NP_001078.2 NP_001258803.1 NP_001596.2 NP_065796.1 NP_001248363.1|NP_001129514.2|NP_079543.1 NP_056238.2 NP_005754.2 NP_036270.1 NP_001073864.2|NP_004911.2 NP_000654.2 NP_005493.2 NP_775099.2|NP_056472.2 NP_689914.3 NP_997698.1|NP_001597.2 NP_001080.2 NP_525023.2 NP_061985.2 NP_001275914.1|NP_001275915.1|NP_009099.1 NP_000918.2|NP_001335874.1 NP_036221.2 NP_001157413.1|NP_848654.3|NP_001157414.1|NP_001157465.1 NP_005680.1|NP_001336757.1 NP_004290.2|NP_001258625.1|NP_001258627.1 NP_001269220.1|NP_009119.2|NP_001269222.1|NP_001269221.1 NP_004987.2 NP_001185863.1|NP_258261.2|NP_001337447.1 NP_000383.2 NP_003777.2|NP_001137542.1 NP_001288759.1 NP_005836.2|NP_001288758.1|NP_001098985.1 NP_005679.2|NP_001306961.1|NP_001018881.1 ... NP_001188336.1|NP_872337.2 NP_689624.2 NP_001002836.2 NP_001337928.1|NP_998768.2|NP_001337929.1|NP_001337931.1 NP_001074290.1|NP_001309426.1|NP_005806.2|NP_660319.1 NP_066575.2 NP_789784.2 NP_919226.1 NP_857597.1 NP_001291378.1 NP_001293144.1|NP_849157.2 NP_443089.3 NP_848552.1 NP_612475.1 NP_001131146.2 NP_689839.1|NP_001274461.1|NP_009070.2|NP_001274462.1|NP_001001415.2|NP_001333841.1|NP_001333842.1|NP_001333845.1 NP_001092096.1|NP_065906.1 NP_006340.1 NP_055020.1 NP_004764.1|NP_001268361.1|NP_001268363.1|NP_001268362.1 NP_060423.3|NP_001164141.1 NP_001265714.1 NP_667339.1 NP_003895.1|NP_001304015.1 NP_976225.1|NP_005446.2 NP_005080.1 NP_001307484.1|NP_001307485.1|NP_001307486.1|NP_001307487.1 NP_001139014.1|NP_001139015.1|NP_001139016.1 NP_001012458.1 NP_001018854.2|NP_001104509.1|NP_001274350.1|NP_001274351.1 NP_001128687.1|NP_001230171.1 NP_955373.3 NP_004715.1 NP_060445.3|NP_001274750.1 NP_008988.2|NP_001005413.1 NP_079388.3|NP_001035743.1 NP_078922.1 NP_001010972.1|NP_001349712.1 NP_055928.3 NP_056349.1|NP_001295166.1
Patient_ID
C3L-00001 -2.5347 -3.4057 0.1572 -1.1998 -1.6826 NaN NaN -0.8179 -0.8053 -0.1899 0.9872 0.9620 -1.7236 0.1699 0.1604 -3.6203 -0.0132 NaN 3.3068 -1.0451 -1.1461 NaN NaN -4.3714 -1.1808 NaN -4.2168 -1.8246 -0.3982 -0.0605 -1.9382 0.2204 0.2898 0.0689 NaN 1.1008 -0.1426 -2.3611 -1.0862 NaN ... NaN 2.2780 0.6211 0.6022 NaN NaN -2.0645 -5.6022 -2.3169 1.6342 NaN -0.1710 NaN 2.4294 -3.1501 0.3403 1.6657 2.3285 0.5044 NaN -0.0795 -0.1931 1.0693 0.3908 0.7316 1.9592 -3.6550 1.2744 NaN NaN NaN NaN 0.2992 -1.3607 NaN NaN 0.6527 -0.9694 -1.1840 -2.5284
C3L-00009 -0.5627 -1.7945 1.0054 -0.3624 -4.4887 0.0079 0.2157 1.3342 0.0645 0.6427 0.0948 0.1628 -0.6043 1.4588 -0.8877 -0.7743 -0.0186 -0.0828 -2.5503 0.5029 0.3668 -0.9708 1.6440 -2.9886 -0.5022 0.4689 -2.3613 -1.6887 -0.6534 1.2472 2.4375 0.2081 0.1666 2.3657 -1.4771 9.9418 0.6994 -3.5100 -2.1951 3.3330 ... -2.0704 -2.5578 -0.3965 1.0432 NaN NaN -1.5565 -0.2944 -0.6685 -0.2415 -1.4280 -1.1900 -3.6612 NaN NaN 0.4764 1.1263 -1.0275 -0.5211 NaN 1.3039 -1.2240 1.1679 0.0910 0.1363 -0.0602 1.0885 -0.3133 NaN 1.1074 11.6158 -0.5098 -0.1622 0.9828 0.5633 -1.4620 -1.0690 0.7674 0.5066 0.4311
C3L-00080 -1.9422 -2.3782 0.1940 0.1920 -2.2655 NaN -1.6626 0.2149 -0.7593 0.6113 -0.0980 -0.4297 0.4757 0.9284 -0.1043 0.2984 1.1558 -1.2350 0.9513 -0.8448 -1.7002 -4.3892 0.7844 -1.7607 -1.7252 NaN -2.7975 -1.0764 -0.7239 NaN -1.4290 1.2142 -0.3963 -1.2350 NaN NaN 0.4590 -0.9429 -0.7343 0.0751 ... NaN 0.1920 1.8650 0.1878 NaN NaN -0.2586 -2.4345 -0.8177 -0.4339 -1.1015 0.7302 -1.7794 1.0827 NaN -0.3087 1.2142 0.4047 0.5570 NaN 1.0264 0.8658 -0.2440 -0.0542 0.0522 NaN -0.5027 -1.5020 NaN 2.2405 NaN NaN -0.2795 0.6613 NaN 0.9659 -0.3442 -1.6480 1.2872 -0.7301
C3L-00083 2.1636 3.1227 -0.3044 -1.7183 -3.2851 -1.8216 3.6147 -0.4863 -1.2387 -0.4946 -0.0068 0.3281 -1.4413 0.3777 0.0594 -1.6149 -0.8873 -2.6815 -2.1565 0.7250 -1.6397 -0.1969 0.8986 -0.8625 NaN -0.7757 -3.9465 0.5141 -2.0862 NaN -0.2837 -2.5906 -1.3421 -0.5897 -0.6186 NaN -2.7270 -1.7224 -1.8092 NaN ... NaN NaN 0.2124 -1.7886 NaN 0.6671 -1.8630 1.6055 -0.0150 -2.3425 NaN -0.1142 NaN -0.8625 NaN -0.1514 -0.0894 -0.2755 -0.6889 0.2620 -0.4656 0.5886 -0.1184 0.7002 0.1049 -0.5111 -1.0940 0.1379 -1.4992 -1.0072 -3.0742 -1.6769 -0.5897 -0.8129 NaN 0.9399 -0.2465 0.3157 0.6547 NaN
C3L-00093 -1.0022 -0.9632 0.8190 0.2556 -11.1252 NaN -0.1696 0.2911 -0.4459 -0.1518 0.3690 0.5533 -0.5912 1.6340 0.1564 -0.5097 0.7942 NaN -0.9349 -0.5558 0.5639 -2.2211 1.1521 -7.0045 -2.7561 2.5835 -3.5888 -0.2900 -1.3069 -1.8668 -0.4459 -0.2865 -0.5841 -0.1199 -3.0254 0.3832 1.2867 -1.7888 -2.9333 -0.3609 ... NaN 0.4434 -0.9774 -2.0085 NaN -1.3176 NaN -4.9494 -2.7419 2.2363 NaN 0.5143 NaN NaN 4.8406 -0.0243 0.2982 -0.9916 0.0324 NaN 0.0076 -0.0916 2.4383 0.4080 0.1139 -1.3176 1.2655 -1.4522 NaN NaN NaN NaN 0.6950 -0.1625 1.8536 -2.2990 0.4293 -0.5876 -0.4991 -0.3077
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
C3N-02582.N 1.8277 3.6204 0.1783 -1.6842 0.6852 NaN 1.5338 -0.6666 2.3787 -0.0458 -0.2589 -1.0964 -0.6556 -0.2442 -0.1964 -0.6336 -1.0083 NaN -0.6630 1.4897 3.5248 4.4983 0.0644 0.5420 3.1648 NaN 4.7444 1.5375 -0.3581 NaN 1.4493 -0.9054 -0.5013 -0.0715 0.2407 NaN -3.5173 NaN 0.2407 -1.1516 ... 0.0203 NaN -1.5005 -0.7328 NaN NaN 0.5640 NaN -1.4418 1.1554 NaN 0.2187 0.3620 2.4448 NaN -0.3324 0.9313 NaN 0.5199 -0.2074 -1.1075 1.0048 -0.8613 -0.4572 -0.5564 -0.6924 -3.3226 1.6293 NaN 0.9460 -0.2001 NaN -0.0826 -1.6769 -0.0017 -0.1266 0.2995 2.3934 0.7770 0.9497
C3N-02586.N 0.8035 1.6403 0.2300 -1.8837 1.4085 NaN 1.3378 -0.8544 0.1946 -0.0726 -0.3908 -0.9801 -0.3044 -0.4615 -0.4065 0.4460 -1.2983 NaN -1.2237 0.5521 1.0000 0.4421 -0.9644 2.3004 2.7207 0.5482 1.6128 1.8368 -0.2179 0.0924 -0.5519 -0.5951 -0.8151 -0.0922 0.1632 -0.4144 -2.2216 -0.0726 -0.0136 -0.2651 ... 1.7032 -0.4458 -1.4555 -0.1079 NaN -1.4869 0.4578 -1.6872 0.7996 0.7328 0.4775 0.0807 2.3829 1.0628 3.3140 -2.7834 -0.9722 NaN -0.1708 1.5578 -0.8544 -1.1372 -0.5911 -0.0608 0.0924 -0.2336 -1.1922 0.9332 NaN NaN -0.8229 0.1750 -0.0804 -1.6401 NaN 2.4025 1.2161 1.6443 1.1886 1.1807
C3N-02587.N 1.7637 2.2513 -0.0532 -1.4159 4.8264 0.8151 0.4511 -0.8181 2.6187 -0.3304 0.1037 -1.1587 -0.6043 -0.1134 -0.4707 -0.7012 -0.4640 2.0142 -1.8668 0.4210 1.9540 NaN 0.2140 2.5452 1.1525 1.4163 3.1998 2.0743 -0.3605 NaN 0.1605 -0.9951 -0.2369 -0.6477 2.0509 NaN -2.1507 -1.5929 0.1605 -0.2336 ... NaN NaN -0.1735 -1.7199 0.1338 -0.2703 -1.7599 NaN NaN 1.4230 0.8886 0.2707 2.4751 2.3348 -0.1668 -0.9684 0.8252 -0.7680 -0.6143 -0.6644 -0.0866 -2.6383 -1.1988 -0.2904 0.1806 1.2560 NaN 1.8572 -13.4196 0.0670 -0.1301 NaN -0.0800 -2.4146 -2.8354 NaN 1.2861 2.1244 0.7083 1.1825
C3N-02588.N 1.0875 1.7414 -0.2270 -1.7000 4.5153 0.4875 NaN -0.2169 0.5044 -0.3012 -0.1225 -1.0831 0.4707 -0.7191 -0.3585 0.1033 -2.3201 NaN -1.2281 0.3055 1.7313 0.1269 -0.1293 2.7492 2.4492 NaN 4.0232 4.9703 0.1370 -0.0686 -0.5203 -0.3585 0.2752 -0.3450 0.4066 NaN -1.7876 NaN 0.2347 0.1673 ... 0.6763 NaN -0.4731 -0.5169 NaN NaN 0.0831 0.8718 NaN 0.4168 0.2448 -0.2877 0.1606 0.2887 -0.1832 -0.6854 -1.3730 -0.7292 -0.9854 NaN -1.5955 -1.4202 -0.1192 -0.8135 0.2179 -1.0933 NaN 0.9628 -7.3995 -1.3157 -0.9652 -0.1293 -0.4764 -1.4775 -2.2999 2.1054 0.4943 1.5459 0.6358 1.2729
C3N-02729.N 2.6011 3.0462 -0.2924 -2.1953 4.1405 1.2990 NaN -0.4741 2.0892 -0.5594 -0.5335 -0.8970 -0.5706 -1.7502 -0.4074 0.2900 -2.3029 1.8295 -1.0306 0.5757 3.4320 2.6715 -1.2791 1.8517 2.1596 0.1528 0.9503 0.3605 0.2010 NaN -0.4853 -0.6077 -0.7783 -0.6411 1.9371 NaN -4.8996 0.8947 0.5534 -1.2309 ... NaN NaN -0.2108 -1.1233 NaN 1.4511 -0.6967 2.9423 -0.4741 0.4978 NaN 0.3902 NaN 2.4972 2.3191 NaN NaN -1.2197 -0.5187 -2.2547 -0.7301 -1.8318 -1.0306 -0.2330 -0.5372 -1.2606 -0.0142 1.3064 NaN NaN NaN 3.2948 -0.7338 NaN -1.4238 -2.0766 0.3234 1.6588 0.6202 0.8390

211 rows × 10699 columns

From dataframe to BigQuery table

The following commands transform the dataframe to a tidy format and save it into a BigQuery table in your project.

In [5]:
tdf =  pd.melt(df, var_name="gene_name", value_name="protein_abundance",ignore_index = False)
tdf.reset_index(inplace=True)
tdf[0:10]
Out[5]:
Patient_ID gene_name protein_abundance
0 C3L-00001 A1BG -2.5347
1 C3L-00009 A1BG -0.5627
2 C3L-00080 A1BG -1.9422
3 C3L-00083 A1BG 2.1636
4 C3L-00093 A1BG -1.0022
5 C3L-00094 A1BG -1.5576
6 C3L-00095 A1BG -1.0718
7 C3L-00140 A1BG -1.0799
8 C3L-00144 A1BG -1.9159
9 C3L-00263 A1BG -1.1384

The following commands send the protein expression data to a BigQuery table.

In [6]:
table_id = 'test_dataset2.luad_cptac_paynelab' # test_dataset2 is dataset and luad_cptac_paynelab is the table name
pandas_gbq.to_gbq(tdf, table_id, project_id=my_project_id)

Compute Pearson correlations in BigQuery

Here we compare protein expressions from the cptac library with those generated from PDC proteomics data. The comparison is made by computing Pearson correlation.

The first step is to build a query to retrieve PDC based protein expressions, which are available in BigQuery tables in the public project isb-cgc-bq.

In [7]:
pdc = '''
With pdc AS (
    SELECT meta.case_submitter_id, quant.gene_symbol, 
           CAST(quant.protein_abundance_log2ratio AS FLOAT64) AS protein_abundance_log2ratio
    FROM `isb-cgc-bq.CPTAC.quant_proteome_CPTAC_LUAD_discovery_study_pdc_current` as quant
    JOIN `isb-cgc-bq.PDC_metadata.aliquot_to_case_mapping_current` as meta
        ON quant.case_id = meta.case_id
        AND quant.aliquot_id = meta.aliquot_id
        AND meta.sample_type = 'Primary Tumor'
)
'''

The following query combines the pdc and cptac data:

In [8]:
cptac = '''
qdata AS (
  SELECT pdc.case_submitter_id, pdc.gene_symbol, pdc.protein_abundance_log2ratio,
        cptac.protein_abundance
  FROM pdc 
  JOIN `{0}.{1}` as cptac 
    ON pdc.case_submitter_id = cptac.Patient_ID
    AND pdc.gene_symbol = cptac.gene_name
)
'''.format(my_project_id, table_id)

Finally we compute Pearson correlations.

In [9]:
mysql = (pdc + ',' + cptac + '''
SELECT gene_symbol, count(*) as N, corr(protein_abundance_log2ratio,protein_abundance) as Correlations
FROM qdata 
WHERE NOT IS_NAN(protein_abundance_log2ratio)
      AND NOT IS_NAN(protein_abundance) 
GROUP BY gene_symbol
HAVING N >= 20 
ORDER BY Correlations DESC
''' )

df1 = pandas_gbq.read_gbq(mysql,project_id=my_project_id )
df1
Downloading: 100%|██████████| 9650/9650 [00:00<00:00, 25317.80rows/s]
Out[9]:
gene_symbol N Correlations
0 GLYATL2 21 0.990671
1 SLC2A10 49 0.989670
2 TNC 108 0.983017
3 BCAS1 104 0.980725
4 SLC27A2 108 0.980098
... ... ... ...
9645 SACS 108 -0.166687
9646 SLC38A9 30 -0.303329
9647 SASS6 27 -0.329240
9648 SPEF2 30 -0.382587
9649 NUFIP1 23 -0.383124

9650 rows × 3 columns

Histogram of correlations

The results above show the correlation between PDC and cptac protein expressions for 9650 proteins. Next we show a histogram of these correlations.

In [10]:
sns.displot(data=df1, x="Correlations", binwidth=0.1)
plt.xlim(-1.0, 1.1)
Out[10]:
(-1.0, 1.1)

The histogram shows that the two pipelines used in this analysis produced similar protein expression for most genes.