This notebook demonstrates how to use a Monet model to perform a label transfer between datasets. This requires one "reference" dataset for which labels are already available (usually from a clustering analysis), and a second "target" dataset to which the labels are supposed to be transfered. Usually the reference and target datasets originate from the same tissue type.
Very briefly, the way the label transfer works is by first projecting both datasets into a shared PC space, defined by a Monet model. Then, a K-nearest-neighbor classifier is trained on the reference data (cells + labels). Finally, the trained classifier is used to predict the cell types of the cells in teh target dataset. For more details, see the Monet paper (Wagner, 2020).
For this tutorial, we will use the same two PBMC datasets as in the previous tutorial. These are two datasets generated using two different technologies/chemistries (10x Genomics' Chromium v2 and v3). This example is also shown and discussed in more detail in Figure 4 of the Monet paper (Wagner, 2020).
# change notebook width and font
from IPython.core.display import HTML, display
display(HTML("""<style>
/* source: http://stackoverflow.com/a/24207353 */
.container { width:95% !important; }
div.prompt, div.CodeMirror pre, div.output_area pre { font-family:'Hack', monospace; font-size: 10.5pt; }
</style>"""))
from monet import util
_LOGGER = util.configure_logger()
# the following is to allow embedding of plotly figures
from plotly.offline import init_notebook_mode
import plotly.graph_objs as go
init_notebook_mode(connected=True)
The label transfer method is provided by Monet in the label_transfer.transfer_knn()
function. By default, the K-nearest neighbor classifier uses K=20. You can change this by passing the num_neighbors
parameter to the transfer_knn()
function.
import gc
from monet import ExpMatrix
from monet import MonetModel
from monet import label_transfer
from monet import util
import pandas as pd
monet_model_file = 'data/v3_human_pbmc_10k_monet_model.pickle'
ref_expression_file = 'data/v3_human_pbmc_10k_expression.npz'
target_expression_file = 'data/v2_human_pbmc_8k_expression.npz'
ref_cell_label_file = 'data/v3_human_pbmc_10k_clustering_annotated.tsv'
ref_cell_labels = util.load_cell_labels(ref_cell_label_file)
print(ref_cell_labels.value_counts())
monet_model = MonetModel.load_pickle(monet_model_file)
ref_matrix = ExpMatrix.load_npz(ref_expression_file)
target_matrix = ExpMatrix.load_npz(target_expression_file)
target_cell_labels = label_transfer.transfer_knn(
monet_model, ref_matrix, ref_cell_labels, target_matrix)
print(target_cell_labels.value_counts())
# free up memory
del ref_matrix, target_matrix; gc.collect()
[2020-06-17 13:02:32] (monet.util.files) INFO: Loaded labels for 10681 cells from tab-delimited plain-text file. Monocytes 3345 CD4+ Memory T cells 1847 Naive T cells 1361 Naive B cells 956 Other 888 CD8+/CD161+ Memory T cells 669 NK cells 563 Memory B cells 457 CD8+/CD161- Memory T cells 363 mDCs 151 pDCs 81 Name: 0, dtype: int64 [2020-06-17 13:02:32] (monet.latent.monet_model) INFO: Loaded Monet model from pickle file "data/v3_human_pbmc_10k_monet_model.pickle". [2020-06-17 13:02:35] (monet.core.exp_matrix) INFO: Loaded expression matrix with 10681 cells and 16319 genes -- .npz format, 36.7 MB (hash: f9d7fac20f4de6184ff55388c267699a). [2020-06-17 13:02:38] (monet.core.exp_matrix) INFO: Loaded expression matrix with 8381 cells and 15510 genes -- .npz format, 19.9 MB (hash: c299645ab748c9dbe4030fc4cace369b). [2020-06-17 13:02:39] (monet.latent.pca_model) INFO: Expression profiles will be scaled 1.00x (on average). [2020-06-17 13:02:44] (monet.latent.pca_model) INFO: Projection onto 30 PCs retained 32.1 % of the total variance in the scaled and FT-transformed data. [2020-06-17 13:02:44] (monet.latent.pca_model) WARNING: No expression data for 1153 / 15510 genes (7.4 %) in the PCA model. [2020-06-17 13:02:45] (monet.latent.pca_model) INFO: Expression profiles will be scaled 1.57x (on average). [2020-06-17 13:02:49] (monet.latent.pca_model) INFO: Projection onto 30 PCs retained 20.8 % of the total variance in the scaled and FT-transformed data. Naive T cells 2116 Monocytes 1670 CD4+ Memory T cells 1405 CD8+/CD161- Memory T cells 791 Naive B cells 781 Other 557 Memory B cells 370 NK cells 280 mDCs 188 CD8+/CD161+ Memory T cells 161 pDCs 62 dtype: int64
24
import gc
from monet import ExpMatrix
from monet import visualize
expression_file = 'data/v2_human_pbmc_8k_expression.npz'
matrix = ExpMatrix.load_npz(expression_file)
cluster_order = [
'Naive T cells',
'CD4+ Memory T cells',
'CD8+/CD161- Memory T cells',
'CD8+/CD161+ Memory T cells',
'NK cells',
'Naive B cells',
'Memory B cells',
'Monocytes',
'mDCs',
'pDCs',
'Other',
]
cluster_colors = {
'Other': 'lightgray',
}
fig, tsne_scores = visualize.tsne_plot(
matrix, num_components=30,
cell_labels=target_cell_labels,
cluster_order=cluster_order,
cluster_colors=cluster_colors,
width=1200)
fig.show()
print(target_cell_labels.value_counts())
# free up memory
del matrix; gc.collect()
[2020-06-17 13:02:53] (monet.core.exp_matrix) INFO: Loaded expression matrix with 8381 cells and 15510 genes -- .npz format, 19.9 MB (hash: c299645ab748c9dbe4030fc4cace369b). [2020-06-17 13:02:53] (root) INFO: No Monet model provided, performing PCA to determine first 30principal components... [2020-06-17 13:02:53] (monet.latent.pca_model) INFO: Converted matrix to float32 data type. [2020-06-17 13:02:58] (monet.latent.pca_model) INFO: The PCA took 1.4 s. [2020-06-17 13:02:58] (monet.latent.pca_model) INFO: The fraction of variance explained by the 30 selected PCs is 25.5 %. [2020-06-17 13:02:58] (root) INFO: Performing t-SNE... [2020-06-17 13:03:18] (root) INFO: t-SNE took 20.2 s.
Naive T cells 2116 Monocytes 1670 CD4+ Memory T cells 1405 CD8+/CD161- Memory T cells 791 Naive B cells 781 Other 557 Memory B cells 370 NK cells 280 mDCs 188 CD8+/CD161+ Memory T cells 161 pDCs 62 dtype: int64
8251