This notebook demonstrates how to use Monet to denoise scRNA-Seq data with ENHANCE (Wagner, 2019). The ENHANCE algorithm separates biological heterogeneity and technical noise using PCA. By performing PCA after aggregating data from neighboring cells, bias against lowly expressed genes is reduced. The number of dimensions (PCs) is determined using a heuristic that involves simulating a "negative control" dataset which only contains technical noise.
# change notebook width and font
from IPython.core.display import HTML, display
display(HTML("""<style>
/* source: http://stackoverflow.com/a/24207353 */
.container { width:95% !important; }
div.prompt, div.CodeMirror pre, div.output_area pre { font-family:'Hack', monospace; font-size: 10.5pt; }
</style>"""))
from monet import util
_LOGGER = util.configure_logger()
# the following is to allow embedding of plotly figures
from plotly.offline import init_notebook_mode
import plotly.graph_objs as go
init_notebook_mode(connected=True)
In Monet, ENHANCE is performed by calling the fit()
method of ENHANCE
objects. This will result in a trained model, with the results being stored inside the ENHANCE
object.
To save the ENHANCE result to disk, we generally want to avoid saving the denoised expression matrix directly. The reason for that is that the denoised matrix is no longer sparse, and saving it to disk can easily require several hundred MB. Thankfully, it is not necessary to save the full matrix. Since the denoised data is the result of a (truncated) PCA, we can simply save the low-dimensional PCA-based representation of the data. This is accomplished by saving the ENHANCE
object using save_pickle()
. To load ENHANCE results from disk, we then call ENHANCE.load_pickle()
.
Once we have a trained ENHANCE model, we can access the denoised expression matrix using the denoised_matrix_
attribute of the ENHANCE
object. This will reconstruct the denoised expression matrix from the low-dimensional representation stored inside the object.
import gc
from monet import ExpMatrix
from monet import ENHANCE
expression_file = 'data/v3_human_pbmc_10k_expression.npz'
enhance_model_file = 'output/v3_human_pbmc_10k_enhance_model.pickle'
# load data
matrix = ExpMatrix.load_npz(expression_file)
# train ENHANCE model
enhance_model = ENHANCE()
enhance_model.fit(matrix)
# save ENHANCE model to disk
enhance_model.save_pickle(enhance_model_file)
# free up memory
del matrix; gc.collect()
[2020-06-17 07:31:14] (monet.core.exp_matrix) INFO: Loaded expression matrix with 10681 cells and 16319 genes -- .npz format, 36.7 MB (hash: f9d7fac20f4de6184ff55388c267699a). [2020-06-17 07:31:14] (monet.denoise.enhance) INFO: Beginning of Phase I (Dimensionality estimation)... [2020-06-17 07:31:14] (monet.denoise.util) INFO: Estimating the number of PCs with var_fold_thresh=2.00 [2020-06-17 07:31:23] (monet.denoise.util) INFO: Performing PCA on pure noise matrix... [2020-06-17 07:31:23] (monet.latent.pca_model) INFO: Converted matrix to float32 data type. [2020-06-17 07:31:29] (monet.latent.pca_model) INFO: The PCA took 2.2 s. [2020-06-17 07:31:29] (monet.latent.pca_model) INFO: The fraction of variance explained by the 100 selected PCs is 5.1 %. [2020-06-17 07:31:29] (monet.denoise.util) INFO: Variance threshold: 4.015 [2020-06-17 07:31:29] (monet.latent.pca_model) INFO: Converted matrix to float32 data type. [2020-06-17 07:31:35] (monet.latent.pca_model) INFO: The PCA took 2.2 s. [2020-06-17 07:31:35] (monet.latent.pca_model) INFO: The fraction of variance explained by the 100 selected PCs is 36.7 %. [2020-06-17 07:31:35] (monet.denoise.util) INFO: The estimated number of PCs is 24. [2020-06-17 07:31:35] (monet.denoise.enhance) INFO: Completed Phase I (Dimensionality estimation). [2020-06-17 07:31:35] (monet.denoise.enhance) INFO: Beginning of Phase II (Aggregation step)... [2020-06-17 07:31:35] (monet.denoise.enhance) INFO: The median transcript count is 5783.0. [2020-06-17 07:31:35] (monet.denoise.enhance) INFO: Will perform denoising with k=35 (value was determined automatically based on a target transcript count of 200000). [2020-06-17 07:31:35] (monet.denoise.enhance) INFO: Inferring a 24-dimensional latent space... [2020-06-17 07:31:36] (monet.latent.monet_model) INFO: Beginning of Phase I (Estimate dimensionality)... [2020-06-17 07:31:36] (monet.latent.monet_model) INFO: Will use 24 PCs (value was pre-specified). [2020-06-17 07:31:36] (monet.latent.monet_model) INFO: Phase I (Estimating dimensionality) took 0.0 s. [2020-06-17 07:31:36] (monet.latent.monet_model) INFO: Beginning of Phase II (Latent space inference)... [2020-06-17 07:31:36] (monet.latent.monet_model) INFO: Learning the latent space... [2020-06-17 07:31:44] (monet.latent.pca_model) INFO: The PCA took 1.5 s. [2020-06-17 07:31:45] (monet.latent.pca_model) INFO: The fraction of variance explained by the 24 selected PCs is 32.9 %. [2020-06-17 07:31:45] (monet.latent.monet_model) INFO: The median transcript count is 5783.0. [2020-06-17 07:31:45] (monet.latent.monet_model) INFO: Will perform kNN-aggregation with num_neighbors=35 (value was pre-specified). [2020-06-17 07:31:45] (monet.latent.monet_model) INFO: Now performing aggregation step 1/1... [2020-06-17 07:31:45] (monet.latent.util) INFO: Calculating the pairwise distances took 0.7 s. [2020-06-17 07:31:51] (monet.latent.util) INFO: Sorting the pairwise distance matrix took 5.9 s. [2020-06-17 07:31:54] (monet.latent.util) INFO: Aggregating the expression values took 2.3 s. [2020-06-17 07:31:59] (monet.latent.pca_model) INFO: The PCA took 1.3 s. [2020-06-17 07:31:59] (monet.latent.pca_model) INFO: The fraction of variance explained by the 24 selected PCs is 90.4 %. [2020-06-17 07:31:59] (monet.latent.monet_model) INFO: Learned a 24-dimensional latent space in 23.4 s. [2020-06-17 07:31:59] (monet.latent.monet_model) INFO: Phase II (Latent space inference) took 23.4 s. [2020-06-17 07:31:59] (monet.latent.monet_model) INFO: Fitting the Monet model took 23.5 s (0.4 min). [2020-06-17 07:32:01] (monet.latent.pca_model) INFO: Expression profiles will be scaled 1.00x (on average). [2020-06-17 07:32:05] (monet.latent.pca_model) INFO: Projection onto 24 PCs retained 31.8 % of the total variance in the scaled and FT-transformed data. [2020-06-17 07:32:05] (monet.denoise.enhance) INFO: Performing nearest-neighbor aggregation with 35 neighbors... [2020-06-17 07:32:06] (monet.denoise.util) INFO: Calculating the pairwise distances took 0.7 s. [2020-06-17 07:32:11] (monet.denoise.util) INFO: Sorting the pairwise distance matrix took 5.7 s. [2020-06-17 07:32:14] (monet.denoise.util) INFO: Aggregating the expression values took 2.6 s. [2020-06-17 07:32:14] (monet.denoise.util) INFO: Nearest-neighbor aggregation took 9.2 s. [2020-06-17 07:32:14] (monet.denoise.enhance) INFO: Completed Phase II (Aggregation step). [2020-06-17 07:32:14] (monet.denoise.enhance) INFO: Beginning of Phase III (PCA step)... [2020-06-17 07:32:14] (monet.latent.pca_model) INFO: Converted matrix to float32 data type. [2020-06-17 07:32:21] (monet.latent.pca_model) INFO: The PCA took 1.3 s. [2020-06-17 07:32:21] (monet.latent.pca_model) INFO: The fraction of variance explained by the 24 selected PCs is 86.9 %. [2020-06-17 07:32:28] (monet.denoise.enhance) INFO: Completed Phase III (PCA step) in 14.3 s. [2020-06-17 07:32:28] (monet.denoise.enhance) INFO: Denoising with ENHANCE took 74.6 s (1.2 min). [2020-06-17 07:32:29] (monet.denoise.enhance) INFO: Saved denoising model to pickle file "output/v3_human_pbmc_10k_enhance_model.pickle".
0
Note that saving the ENHANCE result as a pickle file is far more space-efficient than saving the denoised matrix, which in this case would require about 560 MB.
!du -h $enhance_model_file
11M output/v3_human_pbmc_10k_enhance_model.pickle
We will now overlay both raw and denoised expression values on a t-SNE plot of the data. First, we perform t-SNE on the raw data, as shown in previous tutorials.
import gc
from monet import ExpMatrix
from monet import visualize
expression_file = 'data/v3_human_pbmc_10k_expression.npz'
matrix = ExpMatrix.load_npz(expression_file)
fig, tsne_scores = visualize.tsne_plot(matrix)
# free up memory
del matrix; gc.collect()
[2020-06-17 09:25:56] (monet.core.exp_matrix) INFO: Loaded expression matrix with 10681 cells and 16319 genes -- .npz format, 36.7 MB (hash: f9d7fac20f4de6184ff55388c267699a). [2020-06-17 09:25:56] (root) INFO: No Monet model provided, performing PCA to determine first 30principal components... [2020-06-17 09:25:56] (monet.latent.pca_model) INFO: Converted matrix to float32 data type. [2020-06-17 09:26:03] (monet.latent.pca_model) INFO: The PCA took 1.3 s. [2020-06-17 09:26:03] (monet.latent.pca_model) INFO: The fraction of variance explained by the 30 selected PCs is 33.4 %. [2020-06-17 09:26:03] (root) INFO: Performing t-SNE... [2020-06-17 09:26:26] (root) INFO: t-SNE took 22.4 s.
3541
Next, we'll visualize the raw data for a gene called CCR7, which is a naive T cell marker.
import gc
from monet import ExpMatrix
from monet import visualize
from monet import util
expression_file = 'data/v3_human_pbmc_10k_expression.npz'
sel_gene = 'CCR7'
matrix = ExpMatrix.load_npz(expression_file)
# make sure matrix is scaled to the median transcript count
matrix = util.scale(matrix)
# get expression values for CCR7
profile = matrix.loc[sel_gene].copy()
# calculate the range of expression values to show in the figure (mean +/ 2 std dev.)
mean = profile.mean()
std = profile.std(ddof=1)
emin = mean - 2*std
emax = mean + 2*std
# plot the t-SNE
fig = visualize.plot_cells(
tsne_scores,
profile=profile,
emin=emin, emax=emax,
colorbar_label='Transcripts',
title='Gene: <i>%s</i> (raw data)' % sel_gene)
fig.show()
# free up memory
del matrix; gc.collect()
[2020-06-17 09:34:32] (monet.core.exp_matrix) INFO: Loaded expression matrix with 10681 cells and 16319 genes -- .npz format, 36.7 MB (hash: f9d7fac20f4de6184ff55388c267699a).
18471
Next, we'll visualize the expression of CCR7 in the denoised data.
import gc
from monet import ENHANCE
from monet import visualize
from monet import util
enhance_model_file = 'data/v3_human_pbmc_10k_enhance_model.pickle'
# uncomment the following line if you are running these notebooks on your computer,
# and want to use the result you generated yourself
#enhance_model_file = 'output/v3_human_pbmc_10k_enhance_model.pickle'
sel_gene = 'CCR7'
enhance_model = ENHANCE.load_pickle(enhance_model_file)
matrix = enhance_model.denoised_matrix_
# make sure matrix is scaled to the median transcript count
matrix = util.scale(matrix)
# get expression values for CCR7
profile = matrix.loc[sel_gene].copy()
# calculate the range of expression values to show in the figure (mean +/ 2 std dev.)
mean = profile.mean()
std = profile.std(ddof=1)
emin = mean - 2*std
emax = mean + 2*std
# plot the t-SNE
fig = visualize.plot_cells(
tsne_scores,
profile=profile,
emin=emin, emax=emax,
colorbar_label='Transcripts',
title='Gene: <i>%s</i> (denoised data)' % sel_gene)
fig.show()
# free up memory
del matrix, enhance_model; gc.collect()
[2020-06-17 09:35:07] (monet.denoise.enhance) INFO: Loaded denoising model from pickle file "data/v3_human_pbmc_10k_enhance_model.pickle".
18467