You can follow this Jupyter Notebook in order to reproduce the figure 3 of our paper about reproducibility in bioinformatics (Kim et al. BioRxiv 2017).
Our StratiPy project is based on NBS (Network Based Stratification) method (Hofree et al. Nat. Meth. 2013).
We recommend that you use this Python 3 project in a virtual environment. For anaconda users, you can activate a virtual environment with:
~$ conda create -n <your_environment_name> python=3.6 anaconda
Then you can activate your virtual environment with:
~$ source activate <your_environment_name>
When you want to deactivate your virtual environment:
~$ source deactivate
More details on Conda's environments.
There are several tuning parameters of NBS. In order to reproduce same result, you only use same values outlined in the original NBS study except two ajustable parameters.
Details about each parameter are explained in docstring of reproducibility.py.
# Defalut settings
data_folder = '../data/'
patient_data = 'TCGA_UCEC' # Uterine endometrial carcinoma (uterine cancer) with 248 patients' somatic mutation data
ppi_data = 'STRING' # STRING PPI network database
influence_weight = 'min' # Influence weight of propagation on the network
simplification = True # Simplification after propagation
compute = True
overwrite = False
alpha = 0.7 # Diffusion (propagation) factor
tol = 10e-3 # Convergence threshold during diffusion
ngh_max = 11 # Number of best influencers in PPI network
keep_singletons = False
min_mutation = 10
max_mutation = 200000
qn = 'mean' # Quantile normalization (QN) after diffusion is based on the mean of ranked values
n_components = 3 # Desired number of subgroups (clusters)
run_bootstrap = True
run_consensus = True
tol_nmf = 1e-3 # Convergence threshold of NMF and GNMF algorithm
linkage_method = 'average' # Linkage method of hierarchical clustering
Import code
import sys
import os
sys.path.append(os.path.dirname(os.path.abspath('.')))
from stratipy import load_data, formatting_data, filtering_diffusion, clustering, hierarchical_clustering
from nbs_functions import all_functions
import scipy.sparse as sp
from scipy.io import loadmat, savemat
import numpy as np
import time
import datetime
from IPython.display import Image, display
from importlib import reload
lambd = 1
n_permutations = 100
lambd1_perm100 = all_functions(data_folder, patient_data, ppi_data, influence_weight,
simplification, compute, overwrite, alpha, tol, ngh_max,
keep_singletons, min_mutation, max_mutation, qn, n_components,
n_permutations, run_bootstrap, run_consensus, lambd, tol_nmf,
linkage_method)
lambd = 1800
lambd1800_perm100 = all_functions(data_folder, patient_data, ppi_data, influence_weight,
simplification, compute, overwrite, alpha, tol, ngh_max,
keep_singletons, min_mutation, max_mutation, qn,
n_components, n_permutations, run_bootstrap, run_consensus,
lambd, tol_nmf, linkage_method)
Due to time-comsuming task of bootstrap, we launched StratiPy with 100 permutations instead of 1000 which was proposed in the original NBS work. However you can always launch StratiPy with 1000 permutations if you want (you have to set "n_permutations = 1000").
In [reproducibility data](reproducibility_data/), you can find results provided by original work with 100 and 1000 permutations. Before comparing testing results of StratiPy (Python), we have to verify whether any significant difference exists between 100 and 1000 permutationsfrom confusion_matrices import get_cluster_idx, repro_confusion_matrix
%matplotlib inline
result_folder_repro = "reproducibility_output/"
# 100 and 1000 permutations of bootstrap
nbs_100 = get_cluster_idx(result_folder_repro, method='nbs',
n_permutations=100)
nbs_1000 = get_cluster_idx(result_folder_repro, method='nbs',
n_permutations=1000, replace_1by2=True)
plot and save
repro_confusion_matrix(result_folder_repro, data1=nbs_100, data2=nbs_1000,
plot_title='Confusion matrix\n100 vs 1000 permutations of Bootstrap',
lambd=1800, tilt=True)
Confusion matrix: each row or column corresponds to a subgroup of patients (here three subgroups). The diagonal elements show the frequency of correct classifications for each subgroup: a high value indicates a correct prediction.
As you can see, there is no significant difference between 2 results (100 vs 1000 permutations). We will now focus only on 100 permutations of Bootstrap.
# lambda 1 and 1800, both 100 permutations of bootstrap
stp_100_lamb1 = get_cluster_idx(result_folder_repro, method='stratipy',
n_permutations=100, replace_1by2=True, lambd=1)
stp_100_lamb1800 = get_cluster_idx(result_folder_repro, method='stratipy',
n_permutations=100, replace_1by2_2by3_3by1=True, lambd=1800)
# between NBS and StratiPy (lambda = 1)
repro_confusion_matrix(result_folder_repro, data1=nbs_100, data2=stp_100_lamb1,
plot_title='Confusion matrix with reported\ntuning parameter value',
lambd=1)
# between NBS and StratiPy (lambda = 1800)
repro_confusion_matrix(result_folder_repro, data1=nbs_100, data2=stp_100_lamb1800,
plot_title='Confusion matrix with actually used\ntuning parameter value',
lambd=1800)