The PTM data from CST has a significant amount of missing data and requires special consideration when normalizing. The starting data is ratio-level-data - where log2 ratios have been calculated from the cancerous cell lines compared to the non-cancerous 'Normal Pool' data from within the 'plex'. This data is under the lung_cellline_3_1_16 directory and each PTM type has its own '_combined_ratios.tsv' file.
This notebook will overview the ratio-level datat from the PTM types: phosphorylation, methylation, and acetylation. The figures in this notebook demonstrate that there is a systematic difference in the distributions of PTM measurements in the lung cancer cell lines regardless of PTMs with missing data are considered. The normalization procedures used to correct for this systematic bias are discussed in the CST_PTM_Normalization_Overview notebook.
The systematic difference in average PTM ratios in the cell lines could be due to a number of factors:
In any case, it can be useful towards understanding the differences in cell line behavior to remove this systematic difference.
I'll start by having a look at the phosphorylation data that can be found in
lung_cellline_3_1_16/lung_cellline_phospho/lung_cellline_TMT_phospho_combined_ratios.tsv
This file was made using the process_latest_cst_data.py
script. First I'll make the necessary imports.
# imports and plotting defaults
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')
from copy import deepcopy
# use clustergrammer module to load/process (source code in clustergrammer directory)
from clustergrammer import Network
Next, I'll load the phosphorylation ratio data and simplify the column names (to improve readability)
# load data data and export as pandas dataframe: inst_df
def load_data(filename):
'''
load data using clustergrammer and export as pandas dataframe
'''
net = deepcopy(Network())
net.load_file(filename)
tmp_df = net.dat_to_df()
inst_df = tmp_df['mat']
# simplify column names (remove categories)
col_names = inst_df.columns.tolist()
# simple_col_names = []
# for inst_name in col_names:
# simple_col_names.append(inst_name[0])
inst_df.columns = col_names
print(inst_df.shape)
ini_rows = inst_df.index.tolist()
unique_rows = list(set(ini_rows))
if len(ini_rows) > len(unique_rows):
print('found duplicate PTMs')
else:
print('did not find duplicate PTMs')
return inst_df
filename = '../lung_cellline_3_1_16/lung_cellline_phospho/' + \
'lung_cellline_TMT_phospho_combined_ratios.tsv'
inst_df = load_data(filename)
(5798, 45) did not find duplicate PTMs
I loaded the phosphorylation tsv file using clustergrammer and exported it as a pandas dataframe. We can see that there are 5,798 unique phosphorylation sites measured in all 45 lung cancer cell lines.
However, there is also a large amount of missing data, e.g. no cell line has all 5798 phosphorylations mesaured. We can plot the number of measured phosphorylation sites (e.g. non-NaN values in the dataframe) below to get a sense of the amount of missing data
inst_df.count().sort_values().plot(kind='bar', figsize=(10,2))
print(type(inst_df))
<class 'pandas.core.frame.DataFrame'>
In the above visualization I have ranked the cell lines based in increasing number of measurements. We can see that there is a pattern in the missing data. The 45 cell lines appear to be aranged into nine groups of 5 cell lines each. These groups correpond to the 9 'plexes', or 'batches', in which the cell lines were measured. Each plex measured one control, Normal Pool, and five cancer cell lines (note that some cell lines have been measured in more than one plex and these have their plex number appended to their name).
Since each cell line has a large number of measured phosphorylations (at least 1,500) we can reasonably expect that the distributions of phosphorylation levels in the cell lines will be similar. This is based on the assumption that biological variation is not systematic and should not result in consistently higher or lower measurements in the cell lines.
Below we plot the mean values (ratios) of all measured phosphorylations in each cell line and order the cell lines by their average phosphorylation levels in ascending order.
def plot_cl_boxplot_with_missing_data(inst_df):
'''
Make a box plot of the cell lines where the cell lines are ranked based
on their average PTM levels
'''
# get the order of the cell lines based on their mean
sorter = inst_df.mean().sort_values().index.tolist()
# reorder based on ascending mean values
sort_df = inst_df[sorter]
# box plot of PTM values ordered based on increasing mean
sort_df.plot(kind='box', figsize=(10,3), rot=90, ylim=(-8,8))
plot_cl_boxplot_with_missing_data(inst_df)
We can see that there is a significant difference in the mean phosphorylation level across the cell lines. These large differenecs in the cell line distributions lead us to believe that there is a systematic error in the measurements that needs to be corrected.
However, each cell line has a different subset of phosphorylations measured so to more fairly compare the cell lines we should only compare commonly measured phosphorylations.
Below we plot the mean values of phosphorylations that were measured in all cell lines.
def plot_cl_boxplot_no_missing_data(inst_df):
# get the order of the cell lines based on their mean
sorter = inst_df.mean().sort_values().index.tolist()
# reorder based on ascending mean values
sort_df = inst_df[sorter]
# transpose to get PTMs as columns
tmp_df = sort_df.transpose()
# keep only PTMs that are measured in all cell lines
ptm_num_meas = tmp_df.count()
ptm_all_meas = ptm_num_meas[ptm_num_meas == 45]
ptm_all_meas = ptm_all_meas.index.tolist()
print('There are ' + str(len(ptm_all_meas)) + ' PTMs measured in all cell lines')
# only keep ptms that are measured in all cell lines
# I will call this full_df as in no missing measurements
full_df = tmp_df[ptm_all_meas]
# transpose back to PTMs as rows
full_df = full_df.transpose()
full_df.plot(kind='box', figsize=(10,3), rot=90, ylim=(-8,8))
num_ptm_all_meas = len(ptm_all_meas)
plot_cl_boxplot_no_missing_data(inst_df)
There are 513 PTMs measured in all cell lines
From the above box plot we can see that there is a significant difference in the distributions of the cell lines even when we only consider phosphorylations that were measured in all cell lines (note that the cell lines are in the same order as the previous box plot). This indicates that this systematic differnce in average phosphorylation values is not caused by missing values.
Since we do not expect biological variation to cause this type of systematic difference between cell lines we can conclude that the large differences between cell lines are likely the result of systematic experimental error that should be corrected. Normalizing the data will be discussed here
I will perform the same overview on the acetylation data. There are 1,192 unique acetylations measured in the 45 cell lines.
filename = '../lung_cellline_3_1_16/lung_cellline_Ack/' + \
'lung_cellline_TMT_Ack_combined_ratios.tsv'
inst_df = load_data(filename)
(1192, 45) did not find duplicate PTMs
inst_df.count().sort_values().plot(kind='bar', figsize=(10,2))
<matplotlib.axes._subplots.AxesSubplot at 0x11228b490>
plot_cl_boxplot_with_missing_data(inst_df)
Distribution of Acetylation data that was measured in all cell lines
plot_cl_boxplot_no_missing_data(inst_df)
There are 125 PTMs measured in all cell lines
filename = '../lung_cellline_3_1_16/lung_cellline_Rme1/' + \
'lung_cellline_TMT_Rme1_combined_ratios.tsv'
inst_df = load_data(filename)
(1248, 45) did not find duplicate PTMs
inst_df.count().sort_values().plot(kind='bar', figsize=(10,2))
<matplotlib.axes._subplots.AxesSubplot at 0x112181690>
plot_cl_boxplot_with_missing_data(inst_df)
Argining Methylation that was measured in all cell lines
plot_cl_boxplot_no_missing_data(inst_df)
There are 193 PTMs measured in all cell lines
There are 230 lysine methylations measured in all cell line
filename = '../lung_cellline_3_1_16/lung_cellline_Kme1/' + \
'lung_cellline_TMT_Kme1_combined_ratios.tsv'
inst_df = load_data(filename)
(230, 45) did not find duplicate PTMs
Some cell lines have as few as 40 lysine methylations measured.
inst_df.count().sort_values().plot(kind='bar', figsize=(10,2))
<matplotlib.axes._subplots.AxesSubplot at 0x10dcfe210>
plot_cl_boxplot_with_missing_data(inst_df)
Lysine methylation that was measured in all cell lines
plot_cl_boxplot_no_missing_data(inst_df)
There are 26 PTMs measured in all cell lines
There were only 26 lysine methylations that were measured in all cell lines. We still see the bias in the average values across the cell lines.
We see that the PTM measurements (phosphorylation, acetylation, and methylation) all show large differences in average behavior across the cell lines. Furthermore, the cell lines with the highest and lowest ratios are frequently the same: DMS153 is hte cell line with the lowest ratios and H661 is the cell line with the highest ratios in all cases. '
In other words, if we were to ask which cell line has the highest or lowest level of a particular PTM site we would almost always get the same cell line no matter which site we were interested in. Since this type of uniform and systematic difference between cell lines is not what we expect biologically we can conclude that the ratio data should be normalized in some way. The normalization procedure and its affects on cell line clustering are discussed in the notebook CST_PTM_Normalization_Overview notebook.