This notebook demonstrates how to use Monet to plot a batch-corrected t-SNE, using batch correction based on identifying mutual nearest neighbors (MNNs), as described by Haghverdi et al. (2018). Monet implements a slightly modified version of the original method, in which the identification of MNNs and the batch correction is performed in PC space, rather than in gene space.
The example used here is also shown and discussed in more detail in Figure 3 of the Monet paper (Wagner, 2020).
# change notebook width and font
from IPython.core.display import HTML, display
display(HTML("""<style>
/* source: http://stackoverflow.com/a/24207353 */
.container { width:95% !important; }
div.prompt, div.CodeMirror pre, div.output_area pre { font-family:'Hack', monospace; font-size: 10.5pt; }
</style>"""))
from monet import util
_LOGGER = util.configure_logger()
# the following is to allow embedding of plotly figures
from plotly.offline import init_notebook_mode
import plotly.graph_objs as go
init_notebook_mode(connected=True)
One way of performing a t-SNE on two or more datasets is to project all datasets into a shared PC space defined by a Monet model. Since the Monet model was only trained on one of the datasets (the "reference" dataset), there is no risk that this PC space explicitly represents any batch effects. However, dependening on the strenghts of the batch effects present, batch effects can still manifest themselves after projection into this PC space. We illustrate this by combining two PBMC datasets generated with different technologies/chemistries (Chromium v2 and v3).
We are obtaining the t-SNE plot with the tsne_plot()
function. Since we are plotting almost 20,000 cells, we are plotting an exaggerated t-SNE using the parameter exaggerated_tsne=True
. (See the Monet tutorial on visualizing data with t-SNE for more details.) We are also using the parameter random_order=True
in order to plot cells in random order. This greatly reduces occurrences where cells from one dataset completely occlude those from the other dataset.
The blue cells represent the reference dataset (Chromium v3), and the red cells represent the target dataset (Chromium v2). As you can see there is almost no overlap, which is due to the strong batch effects present.
import gc
from monet.core import ExpMatrix
from monet.latent import MonetModel
from monet import visualize
from monet.visualize import ACCESSIBLE_COLORS
from monet import util
monet_file = 'data/v3_human_pbmc_10k_monet_model.pickle'
ref_expression_file = 'data/v3_human_pbmc_10k_expression.npz'
target_expression_file = 'data/v2_human_pbmc_8k_expression.npz'
monet_model = MonetModel.load_pickle(monet_file)
ref_matrix = ExpMatrix.load_npz(ref_expression_file)
target_matrix = ExpMatrix.load_npz(target_expression_file)
datasets = {
'Reference': ref_matrix,
'Target': target_matrix,
}
combined_matrix, cell_labels = util.combine_matrices(datasets)
cluster_order = ['Reference', 'Target']
cluster_colors = {
'Reference': ACCESSIBLE_COLORS[3],
'Target': ACCESSIBLE_COLORS[2],
}
fig, tsne_scores = visualize.tsne_plot(
combined_matrix, monet_model=monet_model,
exaggerated_tsne=True,
random_order=True,
cell_labels=cell_labels,
cluster_order=cluster_order,
cluster_colors=cluster_colors,
marker_size=2.5)
fig.show()
del ref_matrix, target_matrix, combined_matrix
gc.collect()
[2020-06-17 12:02:52] (monet.latent.monet_model) INFO: Loaded Monet model from pickle file "data/v3_human_pbmc_10k_monet_model.pickle". [2020-06-17 12:02:56] (monet.core.exp_matrix) INFO: Loaded expression matrix with 10681 cells and 16319 genes -- .npz format, 36.7 MB (hash: f9d7fac20f4de6184ff55388c267699a). [2020-06-17 12:02:59] (monet.core.exp_matrix) INFO: Loaded expression matrix with 8381 cells and 15510 genes -- .npz format, 19.9 MB (hash: c299645ab748c9dbe4030fc4cace369b). [2020-06-17 12:03:01] (root) INFO: Using Monet model to project data onto a 30-dimensional latent space... [2020-06-17 12:03:04] (monet.latent.pca_model) INFO: Expression profiles will be scaled 1.22x (on average). [2020-06-17 12:03:11] (monet.latent.pca_model) INFO: Projection onto 30 PCs retained 27.5 % of the total variance in the scaled and FT-transformed data. [2020-06-17 12:03:11] (root) INFO: Performing exaggerated t-SNE... [2020-06-17 12:03:28] (root) INFO: t-SNE took 17.3 s.
22989
For batch-correted t-SNE, Monet provides the function visualize.batch_corrected_tsne_plot()
. It currently only supports the integration of two datasets (reference and target). Aside from the t-SNE scores, it also returns the batch-corrected PC scores that formed the basis for the t-SNE.
As you can see from the results, the batch correction procedure for this tasks involving almost 20,000 cells took approx. 48 seconds. The blue cells again represent the reference dataset (Chromium v3), and the red cells represent the target dataset (Chromium v2). As you can see almost all clusters now consist of a mixture of cells from the reference and target datasets, indicating that batch effects have been greatly reduced.
import gc
import sys
import time
from monet.core import ExpMatrix
from monet.latent import MonetModel
from monet.visualize import batch_corrected_tsne_plot
monet_file = 'data/v3_human_pbmc_10k_monet_model.pickle'
ref_expression_file = 'data/v3_human_pbmc_10k_expression.npz'
target_expression_file = 'data/v2_human_pbmc_8k_expression.npz'
# load the Monet model
monet_model = MonetModel.load_pickle(monet_file)
# load the expression matrices
ref_matrix = ExpMatrix.load_npz(ref_expression_file)
target_matrix = ExpMatrix.load_npz(target_expression_file)
fig, legend_fig, tsne_scores, pc_scores = batch_corrected_tsne_plot(
monet_model, ref_matrix, target_matrix)
fig.show()
# free up memory
del monet_model, ref_matrix, target_matrix
gc.collect()
[2020-06-17 12:03:29] (monet.latent.monet_model) INFO: Loaded Monet model from pickle file "data/v3_human_pbmc_10k_monet_model.pickle". [2020-06-17 12:03:33] (monet.core.exp_matrix) INFO: Loaded expression matrix with 10681 cells and 16319 genes -- .npz format, 36.7 MB (hash: f9d7fac20f4de6184ff55388c267699a). [2020-06-17 12:03:35] (monet.core.exp_matrix) INFO: Loaded expression matrix with 8381 cells and 15510 genes -- .npz format, 19.9 MB (hash: c299645ab748c9dbe4030fc4cace369b). [2020-06-17 12:03:35] (monet.batch_correct.mnn) INFO: Determining all MNN pairs... [2020-06-17 12:03:36] (monet.latent.pca_model) INFO: Expression profiles will be scaled 1.00x (on average). [2020-06-17 12:03:40] (monet.latent.pca_model) INFO: Projection onto 30 PCs retained 32.1 % of the total variance in the scaled and FT-transformed data. [2020-06-17 12:03:40] (monet.latent.pca_model) WARNING: No expression data for 1153 / 15510 genes (7.4 %) in the PCA model. [2020-06-17 12:03:41] (monet.latent.pca_model) INFO: Expression profiles will be scaled 1.57x (on average). [2020-06-17 12:03:44] (monet.latent.pca_model) INFO: Projection onto 30 PCs retained 20.8 % of the total variance in the scaled and FT-transformed data. [2020-06-17 12:04:18] (monet.batch_correct.mnn) INFO: Calculating batch correction vectors for all MNN pairs... [2020-06-17 12:04:23] (monet.batch_correct.mnn) INFO: Applying batch correction to target PC scores... [2020-06-17 12:04:24] (monet.batch_correct.mnn) INFO: Batch correction using mutual nearest neighbors took 48.3 s. [2020-06-17 12:04:24] (root) INFO: Performing exaggerated t-SNE... [2020-06-17 12:04:41] (root) INFO: t-SNE took 17.5 s.
22990
To obtain a figure legend, batch_corrected_tsne_plot()
returns a second figure object specifically for that purpose.
legend_fig.layout.width=900
legend_fig.data[0].marker.size=20
legend_fig.data[1].marker.size=20
legend_fig.show()