Notebook

Overview¶

This notebook demonstrates how to use a Monet model to perform a label transfer between datasets. This requires one "reference" dataset for which labels are already available (usually from a clustering analysis), and a second "target" dataset to which the labels are supposed to be transfered. Usually the reference and target datasets originate from the same tissue type.

Very briefly, the way the label transfer works is by first projecting both datasets into a shared PC space, defined by a Monet model. Then, a K-nearest-neighbor classifier is trained on the reference data (cells + labels). Finally, the trained classifier is used to predict the cell types of the cells in teh target dataset. For more details, see the Monet paper (Wagner, 2020).

For this tutorial, we will use the same two PBMC datasets as in the previous tutorial. These are two datasets generated using two different technologies/chemistries (10x Genomics' Chromium v2 and v3). This example is also shown and discussed in more detail in Figure 4 of the Monet paper (Wagner, 2020).

Set up the notebook¶

In [1]:

# change notebook width and font
from IPython.core.display import HTML, display
display(HTML("""<style>
    /* source: http://stackoverflow.com/a/24207353 */
    .container { width:95% !important; }
    div.prompt, div.CodeMirror pre, div.output_area pre { font-family:'Hack', monospace; font-size: 10.5pt; }
    </style>"""))

from monet import util

_LOGGER = util.configure_logger()

# the following is to allow embedding of plotly figures
from plotly.offline import init_notebook_mode
import plotly.graph_objs as go
init_notebook_mode(connected=True)

Perform label transfer¶

The label transfer method is provided by Monet in the label_transfer.transfer_knn() function. By default, the K-nearest neighbor classifier uses K=20. You can change this by passing the num_neighbors parameter to the transfer_knn() function.

In [2]:

import gc

from monet import ExpMatrix
from monet import MonetModel
from monet import label_transfer
from monet import util

import pandas as pd

monet_model_file = 'data/v3_human_pbmc_10k_monet_model.pickle'
ref_expression_file = 'data/v3_human_pbmc_10k_expression.npz'
target_expression_file = 'data/v2_human_pbmc_8k_expression.npz'

ref_cell_label_file = 'data/v3_human_pbmc_10k_clustering_annotated.tsv'

ref_cell_labels = util.load_cell_labels(ref_cell_label_file)
print(ref_cell_labels.value_counts())

monet_model = MonetModel.load_pickle(monet_model_file)

ref_matrix = ExpMatrix.load_npz(ref_expression_file)
target_matrix = ExpMatrix.load_npz(target_expression_file)

target_cell_labels = label_transfer.transfer_knn(
    monet_model, ref_matrix, ref_cell_labels, target_matrix)
print(target_cell_labels.value_counts())

# free up memory
del ref_matrix, target_matrix; gc.collect()

[2020-06-17 13:02:32] (monet.util.files) INFO: Loaded labels for 10681 cells from tab-delimited plain-text file.
Monocytes                     3345
CD4+ Memory T cells           1847
Naive T cells                 1361
Naive B cells                  956
Other                          888
CD8+/CD161+ Memory T cells     669
NK cells                       563
Memory B cells                 457
CD8+/CD161- Memory T cells     363
mDCs                           151
pDCs                            81
Name: 0, dtype: int64
[2020-06-17 13:02:32] (monet.latent.monet_model) INFO: Loaded Monet model from pickle file "data/v3_human_pbmc_10k_monet_model.pickle".
[2020-06-17 13:02:35] (monet.core.exp_matrix) INFO: Loaded expression matrix with 10681 cells and 16319 genes -- .npz format, 36.7 MB (hash: f9d7fac20f4de6184ff55388c267699a).
[2020-06-17 13:02:38] (monet.core.exp_matrix) INFO: Loaded expression matrix with 8381 cells and 15510 genes -- .npz format, 19.9 MB (hash: c299645ab748c9dbe4030fc4cace369b).
[2020-06-17 13:02:39] (monet.latent.pca_model) INFO: Expression profiles will be scaled 1.00x (on average).
[2020-06-17 13:02:44] (monet.latent.pca_model) INFO: Projection onto 30 PCs retained 32.1 % of the total variance in the scaled and FT-transformed data.
[2020-06-17 13:02:44] (monet.latent.pca_model) WARNING: No expression data for 1153 / 15510 genes (7.4 %) in the PCA model.
[2020-06-17 13:02:45] (monet.latent.pca_model) INFO: Expression profiles will be scaled 1.57x (on average).
[2020-06-17 13:02:49] (monet.latent.pca_model) INFO: Projection onto 30 PCs retained 20.8 % of the total variance in the scaled and FT-transformed data.
Naive T cells                 2116
Monocytes                     1670
CD4+ Memory T cells           1405
CD8+/CD161- Memory T cells     791
Naive B cells                  781
Other                          557
Memory B cells                 370
NK cells                       280
mDCs                           188
CD8+/CD161+ Memory T cells     161
pDCs                            62
dtype: int64

Out[2]:

Plot the results¶

In [3]:

import gc

from monet import ExpMatrix
from monet import visualize

expression_file = 'data/v2_human_pbmc_8k_expression.npz'

matrix = ExpMatrix.load_npz(expression_file)

cluster_order = [
    'Naive T cells',
    'CD4+ Memory T cells',
    'CD8+/CD161- Memory T cells',
    'CD8+/CD161+ Memory T cells',
    'NK cells',
    'Naive B cells',
    'Memory B cells',
    'Monocytes',
    'mDCs',
    'pDCs',
    'Other',
]
    
cluster_colors = {
    'Other': 'lightgray',
}

fig, tsne_scores = visualize.tsne_plot(
    matrix, num_components=30,
    cell_labels=target_cell_labels,
    cluster_order=cluster_order,
    cluster_colors=cluster_colors,
    width=1200)
fig.show()

print(target_cell_labels.value_counts())

# free up memory
del matrix; gc.collect()

[2020-06-17 13:02:53] (monet.core.exp_matrix) INFO: Loaded expression matrix with 8381 cells and 15510 genes -- .npz format, 19.9 MB (hash: c299645ab748c9dbe4030fc4cace369b).
[2020-06-17 13:02:53] (root) INFO: No Monet model provided, performing PCA to determine first 30principal components...
[2020-06-17 13:02:53] (monet.latent.pca_model) INFO: Converted matrix to float32 data type.
[2020-06-17 13:02:58] (monet.latent.pca_model) INFO: The PCA took 1.4 s.
[2020-06-17 13:02:58] (monet.latent.pca_model) INFO: The fraction of variance explained by the 30 selected PCs is 25.5 %.
[2020-06-17 13:02:58] (root) INFO: Performing t-SNE...
[2020-06-17 13:03:18] (root) INFO: t-SNE took 20.2 s.

Naive T cells                 2116
Monocytes                     1670
CD4+ Memory T cells           1405
CD8+/CD161- Memory T cells     791
Naive B cells                  781
Other                          557
Memory B cells                 370
NK cells                       280
mDCs                           188
CD8+/CD161+ Memory T cells     161
pDCs                            62
dtype: int64

Out[3]: