The pipeline consists of four successive steps: data pre-processing, cellular clustering and pseudo-temporal ordering, determining differential expressed genes and identifying biomarkers.
library(DIscBIO)
Loading required package: SingleCellExperiment Loading required package: SummarizedExperiment Loading required package: MatrixGenerics Loading required package: matrixStats Attaching package: ‘MatrixGenerics’ The following objects are masked from ‘package:matrixStats’: colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, colCounts, colCummaxs, colCummins, colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats, colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds, colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads, colWeightedMeans, colWeightedMedians, colWeightedSds, colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet, rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods, rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins, rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks, rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, rowWeightedMads, rowWeightedMeans, rowWeightedMedians, rowWeightedSds, rowWeightedVars Loading required package: GenomicRanges Loading required package: stats4 Loading required package: BiocGenerics Loading required package: parallel Attaching package: ‘BiocGenerics’ The following objects are masked from ‘package:parallel’: clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB The following objects are masked from ‘package:stats’: IQR, mad, sd, var, xtabs The following objects are masked from ‘package:base’: anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min Loading required package: S4Vectors Attaching package: ‘S4Vectors’ The following object is masked from ‘package:base’: expand.grid Loading required package: IRanges Loading required package: GenomeInfoDb Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Attaching package: ‘Biobase’ The following object is masked from ‘package:MatrixGenerics’: rowMedians The following objects are masked from ‘package:matrixStats’: anyMissing, rowMedians
The "CTCdataset" dataset consisting of single migratory circulating tumor cells (CTCs) collected from patients with breast cancer. Data are available in the GEO database with accession numbers GSE51827, GSE55807, GSE67939, GSE75367, GSE109761, GSE111065 and GSE86978. The dataset should be formatted in a data frame where columns refer to samples and rows refer to genes. We provide here the possibility to load the dataset either as ".csv" or ".rda" extensions.
FileName<-"CTCdataset" # Name of the dataset
#CSV=TRUE # If the dataset has ".csv", the user shoud set CSV to TRUE
CSV=FALSE # If the dataset has ".rda", the user shoud set CSV to FALSE
if (CSV==TRUE){
DataSet <- read.csv(file = paste0(FileName,".csv"), sep = ",",header=T)
rownames(DataSet)<-DataSet[,1]
DataSet<-DataSet[,-1]
} else{
load(paste0(FileName,".rda"))
DataSet<-get(FileName)
}
cat(paste0("The ", FileName," contains:","\n","Genes: ",length(DataSet[,1]),"\n","cells: ",length(DataSet[1,]),"\n"))
The CTCdataset contains: Genes: 13181 cells: 1462
sc<- DISCBIO(DataSet) # The DISCBIO class is the central object storing all information generated throughout the pipeline
Prior to applying data analysis methods, it is standard to pre-process the raw read counts resulted from the sequencing. The preprocessing approach depends on the existence or absence of ERCC spike-ins. In both cases, it includes normalization of read counts and gene filtering.
To account for RNA composition and sequencing depth among samples (single-cells), the normalization method “median of ratios” is used. This method takes the ratio of the gene instantaneous median to the total counts for all genes in that cell (column median). The gene instantaneous median is the product of multiplying the median of the total counts across all cells (row median) with the read of the target gene in each cell. This normalization method makes it possible to compare the normalized counts for each gene equally between samples.
The key idea in filtering genes is to appoint the genes that manifest abundant variation across samples. Filtering genes is a critical step due to its dramatic impact on the downstream analysis. In case the raw data includes ERCC spike-ins, genes will be filtered based on variability in comparison to a noise level estimated from the ERCC spike-ins according to an algorithm developed by Brennecke et al (Brennecke et al., 2013). This algorithm utilizes the dependence of technical noise on the average read count and fits a model to the ERCC spike-ins. Further gene filtering can be implemented based on gene expression. In case the raw data does not include ERCC spike-ins, genes will be only filtered based on minimum expression in certain number of cells.
To normalize and filter the raw data that does not include ERCCs can be done by applying the function Normalizedata() and giving the parameters minexpr and minnumber some values. This function will discard cells with less than mintotal transcripts. Genes that are not expressed at minexpr transcripts in at least minnumber cells are discarded. Furthermore, it will normalize the count reads using the normalization method “median of ratios”.
To Finalize the preprocessing the function FinalPreprocessing() should be implemented by setting the parameter "GeneFlitering" to ExpF.
# Estimating a value for the "mintotal" parameters
# As a common practice, mintotal is set to 1000
S1<-summary(colSums(DataSet,na.rm=TRUE)) # It gives an idea about the number of reads across cells
print(S1)
Min. 1st Qu. Median Mean 3rd Qu. Max. 844 1916 4070 436014 38916 12865982
# Estimating a value for the "minexpr" parameter
S2<-summary(rowMeans(DataSet,na.rm=TRUE)) # It gives an idea about the overall expression of the genes
print(S2)
minexpr= S2[3] # S2[3] is referring to the median whereas S2[4] is referring to the mean
Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 1.38 7.14 33.08 21.91 34476.42
# Estimating a value for the "minnumber" parameters
minnumber= round(length(DataSet[1,])/10) # To be expressed in at 10% of the cells.
print(minnumber)
[1] 146
sc<-Normalizedata(sc, mintotal=1000, minexpr=minexpr, minnumber=minnumber, maxexpr=Inf, downsample=FALSE, dsn=1, rseed=17000)
sc<-FinalPreprocessing(sc,GeneFlitering="ExpF",export = TRUE) # The GeneFiltering should be set to "ExpF"
The gene filtering method = Noise filtering The Filtered Normalized dataset contains: Genes: 215 cells: 1448 The Filtered Normalized dataset was saved as: filteredDataset.Rdata
Cellular clustering is performed according to the gene expression profiles to detect cellular sub-population with unique properties. After clustering, pseudo-temporal ordering is generated to indicate the cellular differentiation degree.
Rare cell type Identification algorithm (RaceID) was used to cluster the pre-processed data using k-means on a similarity distance matrix, which was based on Pearson correlation and the similarity matrix was computed as “1 – Pearson correlation”. The approach of the proposed clustering, i.e., applying k-means on a similarity distance matrix using the Euclidean metric, improves cluster separation. RaceID estimates the number of clusters by finding the minimal clusters' number at the saturation level of gap statistics, which standardizes the within-cluster dispersion.
The Clustexp() functions has several parameters:
TRUE
, intermediate output is suppressedload("SC.RData") # Loading the "SC" object that has include the data of the k-means clustering
load("Ndata.RData") # Loading the "Ndata" object and stored in the @ndata will be used to plot the expression of genes
load("expdata.RData") # Loading the "expdata" object and stored in the @expdata will be used to plot the expression of genes
sc<-SC # Storing the data of SC in the sc
sc@ndata<-Ndata
sc@expdata<-expdata
########## Removing the unneeded objects
rm(Ndata)
rm(expdata)
rm(DataSet)
rm(SC)
#sc<- Clustexp(sc,cln=4,quiet=T,clustnr=6,rseed=17000)
plotGap(sc) ### Plotting gap statistics
Here you visualize the K-means clusters using t-distributed stochastic neighbor embedding (tSNE), which is a non-linear dimensionality reduction method that places neighbor cells close to each other.
############ Plotting the clusters
plottSNE(sc)
sc<-pseudoTimeOrdering(sc,quiet = TRUE, export = FALSE)
plotOrderTsne(sc)
g='ENSG00000104413' #### Plotting the log expression of ESRP1
plotExptSNE(sc,g)
g='ENSG00000251562' #### Plotting the log expression of MALAT1
plotExptSNE(sc,g)