We simulate paired scDNA and RNA data following the procedure as illustrated in supplement (Figure S1). The simulation principle is to coherently generate scRNA and scDNA data from the same ground truth genetic copy number and clonality while also allowing adding sequencing platform specific noises.
import pandas as pd
import sys
sys.path.append('~/CCNMF/SimulationCode/')
# The Simulation.py (module) is restored in '~/simulationCode'
import Simulation as st
Specifically, we estimated the transition probabilty matrix as follows: we downloaded the TCGA genetic copy number difference GCN data from cBioPortal with 171 triple-negative breast cancer basal samples on paired bulk RNA-seq and DNA-seq data. The below Probmatrix's cloumns are copy number from 1 to 5 as well as the rows.
ProbMatrix = [[0.42, 0.5, 0.08, 0, 0],
[0.02, 0.52, 0.46, 0, 0],
[0, 0, 0.5, 0.5, 0],
[0, 0, 0.01, 0.4, 0.59],
[0, 0, 0, 0.01, 0.99]]
The various configurations for simulated data. The details of each parameter are shown as the annotation.
Paramaters = {'Ncluster' : [3], # The number of clusters, 2 or 3
'Topology' : ['linear'], # The clonal structure of simulated data: 'linear' or 'bifurcate'
'C1Percent' : [0.5, 0.5], # The each cluster percentage if the data has 2 clusters
'C2Percent':[0.2, 0.4, 0.4], # The each cluster percentage if the data has 3 clusters
'Percentage' : [0.1, 0.2, 0.3, 0.4, 0.5], # The simulated copy number fraction in each cluster on various cases
'Outlier': [0.5], # The simulated outlier percentages in each cluster on various cases
'Dropout': [0.5]} # The simulated dropout percentages in each cluster on various cases
Simulate the Genetic Copy file for the pecific clone structure, nGenes is the number of genes, nCells is the number of cells
Configure = st.GeneticCN(Paramaters, nGenes = 200, nCells = 100)
We simulate the scDNA data based on their associated clonal copy number profiles and transition probability matrix.
DNAmatrix = st.Simulate_DNA(ProbMatrix, Configure)
Simulate the scRNA data based on their associated clonal copy number profiles.
RNAmatrix = st.Simulate_RNA(Configure)
The above procedures are how to simulate the various copy number fractions in each cluster for linear structure with 3 clusters when the default of outlier percentange and dropout percentange are 0.5. Meanwhile, if we need to simulate other configuration such as bifurcate structure with 3 clusters for various dropout percentages. it is best to give the default of "Percentage" and "Dropout". Such as: Paramaters = {'Ncluster' : [3], 'Topology' : ['bifurcate'], 'C1Percent' : [0.5, 0.5], 'C2Percent':[0.2, 0.4, 0.4], 'Percentage' : [0.5], 'Outlier': [0.5], 'Dropout': [0.1, 0.2, 0.3, 0.4, 0.5]}
Finally, save each pair datasets as '.csv' file.
DNA1 = DNAmatrix[1]
RNA1 = RNAmatrix[1]
DNA1 = pd.DataFrame(DNA1)
RNA1 = pd.DataFrame(RNA1)
DNA1.to_csv('DNA1.csv', index = 0)
RNA1.to_csv('RNA1.csv', index = 0)