Welcome to the CytoMod example Jupyter Notebook!
The full paper describing the method can be found at
https://www.frontiersin.org/articles/10.3389/fimmu.2019.01338/
This notebook runs over an example dataset from the FLU09 study.
In order to run the notebook yourself with your own data, download the CytoMod folder from https://github.com/liel-cohen/CytoMod. The folder contains a folder data_files/data that contains files named cytokine_data.xlsx and patient_data.xlsx, which hold the data for the notebook analysis. In the folder you have downloaded, you can replace these files with your own data files while following the format instructions bellow.
(named PTID in the example file) which will be converted
to row indexes. If your dataset has no subject IDs,
change 'indexCol=0' to 'indexCol=None', in both cy_data and patients_data
initialization (under the Import data header).
The next columns are your cytokines data. Each column should have
the raw cytokine measurment for the subject indicated in the specific row.
Make sure binary columns contain 0 and 1 values, or True and False values
(and cells with unknown values are left empty)
A folder named 'output' will be created by the code inside the data_files folder. The code writes all results and figures into this folder.
See the Define arguments section for further instructions for this analysis.
The code was written using the Anaconda3 Python interpreter and packages.
Recommended versions: Python 3.7.1, Pandas 0.23.4, Numpy 1.16.2
The palettable module (https://pypi.org/project/palettable/) must also be installed.
import os
import sys
import pandas as pd
sys.path.append(os.path.join(os.getcwd(), 'cytomod', 'otherTools'))
import matplotlib.pyplot as plt
import cytomod
import cytomod.run_gap_statistic as gap_stat
import cytomod.assoc_to_outcome as outcome
from cytomod import plotting as cyplot
from hclusterplot import plotHColCluster
import tools
import numpy as np
import random
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
args = tools.Object()
args.name_data = 'FLU09'
args.name_compartment = 'Plasma'
args.log_transform = True
args.max_testing_k = 8
args.max_final_k = 6 # Must be <= max_testing_k
args.recalculate_modules = True
args.outcomes = ['FluPositive'] # names of outcome columns
args.logistic = True # True if outcomes are binary. False if outcomes are continuous.
args.covariates = ['Age'] # names of regression covariates to control for
args.log_column_names = ['Age'] # or empty list: []
args.cytokines = None # if none, will analyze all
args.seed = 1234
args.path_files = os.path.join(os.getcwd(), 'data_files')
args.paths = {'files': os.path.join(os.getcwd(), 'data_files'),
'data': os.path.join(os.getcwd(), 'data_files', 'data'),
'gap_statistic': os.path.join(os.getcwd(), 'data_files', 'output', 'gap_statistic'),
'clustering': os.path.join(os.getcwd(), 'data_files', 'output', 'clustering'),
'clustering_info': os.path.join(os.getcwd(), 'data_files', 'output', 'clustering', 'info'),
'clustering_figures': os.path.join(os.getcwd(), 'data_files', 'output', 'clustering', 'figures'),
'correlation_figures': os.path.join(os.getcwd(), 'data_files', 'output', 'correlations'),
'association_figures': os.path.join(os.getcwd(), 'data_files', 'output', 'associations'),
}
tools.create_folder(args.paths['gap_statistic'])
tools.create_folder(args.paths['clustering_info'])
tools.create_folder(args.paths['clustering_figures'])
tools.create_folder(args.paths['correlation_figures'])
tools.create_folder(args.paths['association_figures'])
print('Data files are read from folder:', os.path.join(os.getcwd(), 'data_files', 'data'),'\n')
print('Output will be saved to folder:', os.path.join(os.getcwd(), 'data_files', 'output'))
Data files are read from folder: C:\Users\liel-\Dropbox\PyCharm\PycharmProjectsNew\CytoMod_git\data_files\data Output will be saved to folder: C:\Users\liel-\Dropbox\PyCharm\PycharmProjectsNew\CytoMod_git\data_files\output
assert type(args.name_data) is str
assert type(args.name_compartment) is str
assert type(args.log_transform) is bool
assert type(args.logistic) is bool
assert type(args.max_testing_k) is int
assert type(args.max_final_k) is int
assert args.max_final_k <= args.max_testing_k
assert type(args.outcomes) is list
assert type(args.covariates) is list
for col_name in args.outcomes + args.covariates + args.log_column_names:
assert type(col_name) is str
tools.assert_column_exists_in_path(os.path.join(args.paths['data'], 'patient_data.xlsx'), col_name)
# If you got here -
print("Args are valid!")
Args are valid!
cy_data = tools.read_excel(os.path.join(args.paths['data'], 'cytokine_data.xlsx'), indexCol=0)
cy_data.dropna(axis='index', how='all', inplace=True)
if args.cytokines is None:
args.cytokines = list(cy_data.columns)
# Only cytokines contained in args.cytokines list
cy_data = cy_data[args.cytokines]
# See first 5 rows of the cytokines dataframe
cy_data.head()
EGF | Eotaxin | FGF-2 | FLT3L | FKN | GCSF | GM-CSF | GRO | IFNα2 | IFNγ | ... | MCP1 | MCP3 | MDC | MIP1α | MIP1β | sCD40-L | TGFα | TNFα | TNFβ | VEGF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID | |||||||||||||||||||||
3200 | 5.24 | 24.34 | 39.02 | 1.94 | 20.86 | 39.52 | 6.60 | 98.41 | 8.12 | 9.97 | ... | 199.33 | 9.66 | 775.28 | 3.01 | 18.40 | 530.95 | 1.39 | 15.25 | 2.09 | 113.09 |
3202 | 179.16 | 39.41 | 28.62 | 2.50 | 49.11 | 20.92 | 356.27 | 1019.90 | 27.21 | 8.36 | ... | 2737.32 | 67.07 | 1158.00 | 171.51 | 150.73 | 10859.95 | 3.61 | 15.93 | 1.61 | 77.21 |
3204 | 191.72 | 42.49 | 7.68 | 0.67 | 33.36 | 10.44 | 22.71 | 2038.28 | 25.98 | 7.36 | ... | 159.52 | 6.66 | 1867.52 | 8.43 | 35.67 | 11849.41 | 2.36 | 4.91 | 0.53 | 45.34 |
3206 | 132.00 | 93.76 | 33.89 | 0.47 | 128.00 | 67.87 | 147.00 | 4132.00 | 27.43 | 13.54 | ... | 301.00 | 23.69 | 1139.00 | 208.00 | 110.00 | 13420.00 | 24.26 | 14.16 | 0.59 | 32.60 |
3209 | 12.91 | 18.75 | 17.37 | 2.50 | 2.62 | 12.83 | 231.27 | 149.67 | 6.48 | 63.64 | ... | 98.76 | 7.09 | 965.21 | 49.08 | 1.51 | 250.07 | 1.90 | 0.68 | 1.61 | 400.35 |
5 rows × 37 columns
if args.outcomes != []:
patient_data = tools.read_excel(os.path.join(args.paths['data'], 'patient_data.xlsx'),
indexCol=0)
patient_data.dropna(axis='index', how='all', inplace=True)
# See first 5 rows of the patients dataframe
patient_data.head()
# Check if args.logistic flag is correct for all outcomes defined in args.outcomes
for outcome_col in args.outcomes:
is_logistic = np.isin(patient_data[outcome_col].unique(), [0, 1]).all() # checks if the data in outcomes column is binary (0,1 or true,false)
if args.logistic != is_logistic: # mismatch! check which case
if args.logistic:
raise Exception('args.logistic defined as True. '
'However, outcome variable ' + outcome_col + ' seems '
'to be continuous and not binary. Please check and fix!')
else:
raise Exception('args.logistic defined as False. '
'However, outcome variable ' + outcome_col + ' seems '
'to be binary and not continuous. Please check and fix!')
# log transform cytokines and args.log_column_names
if args.log_transform:
cy_data = np.log10(cy_data)
if args.log_column_names != [] and args.outcomes != []:
for col_name in args.log_column_names:
new_col_name = 'log_' + col_name
# log transform variable
patient_data[new_col_name] = np.log10(patient_data[col_name])
# replace column with new log transformed column
if col_name in args.covariates:
args.covariates.remove(col_name)
args.covariates.append(new_col_name)
do_recalculate = args.recalculate_modules or \
not os.path.exists(os.path.join(args.paths['clustering'], 'cyto_mod_adj.dill'))
# If modules file does not exist in storage,
# or args.recalculate_modules=True - prepare modules.
# Otherwise - read from file.
if do_recalculate:
cyto_mod_abs = cytomod.cytomod_class(args.name_data, args.name_compartment, False, cy_data)
cyto_mod_adj = cytomod.cytomod_class(args.name_data, args.name_compartment, True, cy_data)
else:
cyto_mod_abs = tools.read_from_dill(os.path.join(args.paths['clustering'], 'cyto_mod_abs.dill'))
cyto_mod_adj = tools.read_from_dill(os.path.join(args.paths['clustering'], 'cyto_mod_adj.dill'))
cols = plotHColCluster(cyto_mod_abs.cyDf, method='complete',
metric='pearson-signed', figsize=(10,6),
save_path=os.path.join(args.paths['correlation_figures'],
'%s_correlation_heatmap.png' % cyto_mod_abs.name))
cyplot.plotMeanCorr(cyto_mod_abs.withMean, cyto_mod_abs.meanS.name,
cyList=sorted(cyto_mod_abs.cyDf.columns),
save_path=os.path.join(args.paths['correlation_figures'],
'%s_cy_mean_correlation.png' % cyto_mod_abs.name))
cols = plotHColCluster(cyto_mod_adj.cyDf, method='complete',
metric='pearson-signed', figsize=(10, 6),
save_path=os.path.join(args.paths['correlation_figures'],
'%s_correlation_heatmap.png' % cyto_mod_adj.name))
# If modules file does not exist in storage,
# or args.recalculate_modules=True - compute best K.
# Otherwise - read from file.
if do_recalculate:
random.seed(args.seed) # set seed for random numbers stream
bestK = {}
bestK['adj'] = gap_stat.getBestK(cyto_mod_adj.cyDf,
max_testing_k = args.max_testing_k,
max_final_k=args.max_final_k,
save_fig_path=os.path.join(args.paths['gap_statistic'], 'gap_stat_adj.png'))
########## Checking K=1 ########## Checking K=2 ########## Checking K=3 ########## Checking K=4 ########## Checking K=5 ########## Checking K=6 ########## Checking K=7 ########## Checking K=8 ########## Checking K=9
<Figure size 432x288 with 0 Axes>
if do_recalculate:
bestK['abs'] = gap_stat.getBestK(cyto_mod_abs.cyDf,
max_testing_k=args.max_testing_k,
max_final_k=args.max_final_k,
save_fig_path=os.path.join(args.paths['gap_statistic'], 'gap_stat_abs.png'))
tools.write_DF_to_excel(os.path.join(args.paths['clustering'], 'bestK.xlsx'), bestK)
else:
# Get modules from storage
bestK = tools.read_excel(os.path.join(args.paths['clustering'], 'bestK.xlsx'))
bestK = dict(bestK['value'])
########## Checking K=1 ########## Checking K=2 ########## Checking K=3 ########## Checking K=4 ########## Checking K=5 ########## Checking K=6 ########## Checking K=7 ########## Checking K=8 ########## Checking K=9
<Figure size 432x288 with 0 Axes>
print(bestK['abs'])
5
print(bestK['adj'])
4
# Cluster and write modules to file
if do_recalculate:
cyto_mod_adj.cluster_cytokines(K=bestK['adj'])
cyto_mod_abs.cluster_cytokines(K=bestK['abs'])
tools.write_to_dill(os.path.join(args.paths['clustering'], 'cyto_mod_adj.dill'), cyto_mod_adj)
tools.write_to_dill(os.path.join(args.paths['clustering'], 'cyto_mod_abs.dill'), cyto_mod_abs)
else: # Read modules from file
cyto_mod_adj = tools.read_from_dill(os.path.join(args.paths['clustering'], 'cyto_mod_adj.dill'))
cyto_mod_abs = tools.read_from_dill(os.path.join(args.paths['clustering'], 'cyto_mod_abs.dill'))
cyto_modules = {'adj': cyto_mod_adj, 'abs': cyto_mod_abs}
cytomod.io.plot_clustering_heatmap(cyto_modules['abs'], args.paths['clustering_figures'],
figsize=(10, 6))
<Figure size 720x432 with 0 Axes>
cytomod.io.plot_color_legend(cyto_modules['abs'], args.paths['clustering_figures'])
cytomod.io.plot_reliability(cyto_modules['abs'], args.paths['clustering_figures'],
figsize=(10, 6))
cytomod.io.plot_color_legend(cyto_modules['abs'], args.paths['clustering_figures'])
cytomod.io.plot_module_correl(cyto_modules['abs'], args.paths['clustering_figures'])
cytomod.io.plot_clustering_heatmap(cyto_modules['adj'], args.paths['clustering_figures'],
figsize=(10,6))
<Figure size 720x432 with 0 Axes>
cytomod.io.plot_color_legend(cyto_modules['adj'], args.paths['clustering_figures'])
cytomod.io.plot_reliability(cyto_modules['adj'], args.paths['clustering_figures'],
figsize=(10,6))
cytomod.io.plot_color_legend(cyto_modules['adj'], args.paths['clustering_figures'])
cytomod.io.plot_module_correl(cyto_modules['adj'], args.paths['clustering_figures'])
cytomod.io.write_modules(cyto_modules['abs'], args.paths['clustering_info'])
cytomod.io.write_modules(cyto_modules['adj'], args.paths['clustering_info'])
# standardize numeric covariates
# *Deleted* - code moved to outcomeAnalysis function.
if args.outcomes != []:
mod_outcome_abs_df = outcome.outcomeAnalysis(cyto_modules['abs'], patient_data,
analyzeModules=True,
outcomeVars=args.outcomes,
adjustmentVars=args.covariates,
logistic=args.logistic,
standardize=True)
cy_outcome_abs_df = outcome.outcomeAnalysis(cyto_modules['abs'], patient_data,
analyzeModules=False,
outcomeVars=args.outcomes,
adjustmentVars=args.covariates,
logistic=args.logistic,
standardize=True)
if args.outcomes != []:
mod_outcome_adj_df = outcome.outcomeAnalysis(cyto_modules['adj'], patient_data,
analyzeModules=True,
outcomeVars=args.outcomes,
adjustmentVars=args.covariates,
logistic=args.logistic,
standardize=True)
cy_outcome_adj_df = outcome.outcomeAnalysis(cyto_modules['adj'], patient_data,
analyzeModules=False,
outcomeVars=args.outcomes,
adjustmentVars=args.covariates,
logistic=args.logistic,
standardize=True)
# If args.logistic = False, these variables must be symmetric around 0.
# For example:
# colorscale_values = [-0.8, -0.4, 0, 0.4, 0.8]
# colorscale_labels = [-0.8, -0.4, 0, 0.4, 0.8]
# If args.logistic = True, these variables must be symmetric around 1.
# For example
# colorscale_values = [1 / 2.5, 1 / 2, 1 / 1.5, 1, 1.5, 2, 2.5]
# colorscale_labels = ['1/2.5', '1/2', '1/1.5', 1, 1.5, 2, 2.5]
colorscale_values = [1 / 2.5, 1 / 2, 1 / 1.5, 1, 1.5, 2, 2.5]
colorscale_labels = ['1/2.5', '1/2', '1/1.5', 1, 1.5, 2, 2.5]
# Does colorscale_values range include all regression coefficients issued by outcomeAnalysis?
outcome.check_colorscale_range(colorscale_values, args.logistic,
cy_outcome_abs_df, mod_outcome_abs_df, cy_outcome_adj_df, mod_outcome_adj_df)
if args.outcomes != []:
outcome.plotResultSummary(cyto_modules['abs'],
mod_outcome_abs_df,
cy_outcome_abs_df,
args.outcomes,
logistic=args.logistic,
fdr_thresh_plot=0.2,
compartmentName=args.name_compartment,
figsize=(6,9),
scale_values=colorscale_values,
scale_labels=colorscale_labels,
save_fig_path=os.path.join(args.paths['association_figures'], 'associations_abs.png'))
if args.outcomes != []:
outcome.plotResultSummary(cyto_modules['adj'],
mod_outcome_adj_df,
cy_outcome_adj_df,
args.outcomes,
logistic=args.logistic,
fdr_thresh_plot=0.2,
compartmentName=args.name_compartment,
figsize=(6,9),
scale_values=colorscale_values,
scale_labels=colorscale_labels,
save_fig_path=os.path.join(args.paths['association_figures'], 'associations_adj.png'))
Output table file type is currently set to pdf. You can change the file type to csv by changing the output file name. For example, in the following code cell, change 'associations_abs_mod_pvals.pdf' to 'associations_abs_mod_pvals.csv'.
if args.outcomes != []:
# cytokines
outcome.printTable(mod_outcome_abs_df,
title=args.name_compartment + ' (Absolute)',
fdr_output_limit=1, fwer_output_limit=1, pval_output_limit=1,
output_file_path=os.path.join(args.paths['association_figures'],
'associations_abs_mod_pvals.pdf'),
print_to_console=True)
# modules
outcome.printTable(cy_outcome_abs_df,
title=args.name_compartment + ' (Absolute)',
fdr_output_limit=1, fwer_output_limit=1, pval_output_limit=1,
output_file_path=os.path.join(args.paths['association_figures'],
'associations_abs_cy_pvals.pdf'),
print_to_console=True)
Plasma (Absolute) ================================== Outcome Module OR pvalue FWER FDR 1 FluPositive Plasma2 1.4 0.033 0.17 0.13 0 FluPositive Plasma1 1.3 0.053 0.21 0.13 4 FluPositive Plasma5 1.3 0.14 0.41 0.23 3 FluPositive Plasma4 0.95 0.75 1 0.82 2 FluPositive Plasma3 1 0.82 1 0.82 Plasma (Absolute) ================================== Outcome Analyte OR pvalue FWER FDR 5 FluPositive GCSF 2.1 5.3e-05 0.002 0.002 26 FluPositive IP10 1.9 0.00034 0.012 0.0063 21 FluPositive IL10 1.8 0.00056 0.019 0.0069 34 FluPositive TNFα 1.6 0.0048 0.16 0.038 8 FluPositive IFNα2 1.6 0.0056 0.18 0.038 10 FluPositive IL1α 1.6 0.0061 0.19 0.038 22 FluPositive IL12-P40 1.5 0.009 0.28 0.048 4 FluPositive FKN 1.5 0.017 0.51 0.079 2 FluPositive FGF-2 1.4 0.029 0.85 0.12 15 FluPositive IL4 0.72 0.043 1 0.16 25 FluPositive IL15 1.4 0.056 1 0.19 17 FluPositive IL6 1.3 0.08 1 0.25 32 FluPositive sCD40-L 0.76 0.088 1 0.25 31 FluPositive MIP1β 1.3 0.12 1 0.31 12 FluPositive IL1Ra/Ra2 1.3 0.13 1 0.31 29 FluPositive MDC 0.73 0.15 1 0.35 9 FluPositive IFNγ 1.2 0.2 1 0.41 28 FluPositive MCP3 1.2 0.2 1 0.41 6 FluPositive GM-CSF 1.2 0.22 1 0.41 23 FluPositive IL12-P70 1.2 0.23 1 0.41 13 FluPositive IL2 1.2 0.24 1 0.41 19 FluPositive IL8 0.86 0.32 1 0.54 14 FluPositive IL3 1.2 0.36 1 0.57 33 FluPositive TGFα 1.1 0.37 1 0.57 24 FluPositive IL13 0.87 0.38 1 0.57 36 FluPositive VEGF 1.1 0.42 1 0.6 3 FluPositive FLT3L 1.1 0.48 1 0.66 30 FluPositive MIP1α 1.1 0.54 1 0.72 20 FluPositive IL9 0.91 0.57 1 0.72 18 FluPositive IL7 1.1 0.65 1 0.79 27 FluPositive MCP1 1.1 0.68 1 0.79 11 FluPositive IL1β 1.1 0.68 1 0.79 7 FluPositive GRO 1 0.91 1 0.98 16 FluPositive IL5 0.98 0.91 1 0.98 35 FluPositive TNFβ 1 0.94 1 0.98 1 FluPositive Eotaxin 1 0.95 1 0.98 0 FluPositive EGF 1 0.99 1 0.99
if args.outcomes != []:
# modules
outcome.printTable(mod_outcome_adj_df,
title=args.name_compartment + ' (Adjusted)',
fdr_output_limit=1, fwer_output_limit=1, pval_output_limit=1,
output_file_path=os.path.join(args.paths['association_figures'],
'associations_adj_mod_pvals.pdf'),
print_to_console=True)
# cytokines
outcome.printTable(cy_outcome_adj_df,
title=args.name_compartment + ' (Adjusted)',
fdr_output_limit=1, fwer_output_limit=1, pval_output_limit=1,
output_file_path=os.path.join(args.paths['association_figures'],
'associations_adj_cy_pvals.pdf'),
print_to_console=True)
Plasma (Adjusted) ================================== Outcome Module OR pvalue FWER FDR 2 FluPositive Plasma3 1.7 0.0011 0.0043 0.0043 1 FluPositive Plasma2 1.3 0.095 0.28 0.19 3 FluPositive Plasma4 0.84 0.27 0.53 0.36 0 FluPositive Plasma1 0.95 0.74 0.74 0.74 Plasma (Adjusted) ================================== Outcome Analyte OR pvalue FWER FDR 5 FluPositive GCSF 2 0.0002 0.0075 0.0075 15 FluPositive IL4 0.57 0.00056 0.02 0.01 26 FluPositive IP10 1.7 0.0014 0.05 0.018 21 FluPositive IL10 1.7 0.0034 0.12 0.032 19 FluPositive IL8 0.67 0.013 0.42 0.085 34 FluPositive TNFα 1.5 0.014 0.44 0.085 8 FluPositive IFNα2 1.5 0.023 0.7 0.12 32 FluPositive sCD40-L 0.7 0.031 0.94 0.15 10 FluPositive IL1α 1.4 0.042 1 0.17 22 FluPositive IL12-P40 1.4 0.052 1 0.19 29 FluPositive MDC 0.66 0.058 1 0.19 20 FluPositive IL9 0.77 0.086 1 0.26 24 FluPositive IL13 0.76 0.097 1 0.26 4 FluPositive FKN 1.3 0.099 1 0.26 16 FluPositive IL5 0.78 0.12 1 0.29 2 FluPositive FGF-2 1.2 0.15 1 0.34 0 FluPositive EGF 0.8 0.15 1 0.34 35 FluPositive TNFβ 0.81 0.19 1 0.4 25 FluPositive IL15 1.2 0.3 1 0.58 7 FluPositive GRO 0.86 0.34 1 0.63 17 FluPositive IL6 1.1 0.42 1 0.73 1 FluPositive Eotaxin 0.9 0.48 1 0.81 18 FluPositive IL7 0.91 0.54 1 0.82 31 FluPositive MIP1β 1.1 0.57 1 0.82 33 FluPositive TGFα 0.92 0.57 1 0.82 11 FluPositive IL1β 0.92 0.58 1 0.82 30 FluPositive MIP1α 0.93 0.66 1 0.9 12 FluPositive IL1Ra/Ra2 1.1 0.68 1 0.9 36 FluPositive VEGF 0.95 0.72 1 0.91 9 FluPositive IFNγ 1 0.8 1 0.92 27 FluPositive MCP1 0.96 0.81 1 0.92 3 FluPositive FLT3L 0.97 0.84 1 0.92 28 FluPositive MCP3 0.97 0.86 1 0.92 6 FluPositive GM-CSF 1 0.89 1 0.92 13 FluPositive IL2 0.98 0.89 1 0.92 14 FluPositive IL3 1 0.92 1 0.92 23 FluPositive IL12-P70 1 0.92 1 0.92