This tutorial shows how to work with CellRank using the low-level mode. We will interact directly with CellRank's two main modules, kernels and estimators. We assume that you have gone through the high level tutorial already.
The first part of this tutorial is very similar to scVelo's tutorial on pancreatic endocrinogenesis. This is essentially the same as in the high level tutorial, so feel free to skip the beginning and go directly to the section Run CellRank. The data we use here comes from Bastidas-Ponce et al. (2018). For more info on scVelo, see the documentation or read the article.
This tutorial notebook can be downloaded using the following link.
import scvelo as scv import scanpy as sc import cellrank as cr import numpy as np scv.settings.verbosity = 3 scv.settings.set_figure_params('scvelo') cr.settings.verbosity = 2
First, we need to get the data. The following commands will download the
adata object and save it under
adata = cr.datasets.pancreas() scv.utils.show_proportions(adata) adata
Abundance of ['spliced', 'unspliced']: [0.81 0.19]
AnnData object with n_obs × n_vars = 2531 × 27998 obs: 'day', 'proliferation', 'G2M_score', 'S_score', 'phase', 'clusters_coarse', 'clusters', 'clusters_fine', 'louvain_Alpha', 'louvain_Beta', 'palantir_pseudotime' var: 'highly_variable_genes' uns: 'clusters_colors', 'clusters_fine_colors', 'day_colors', 'louvain_Alpha_colors', 'louvain_Beta_colors', 'neighbors', 'pca' obsm: 'X_pca', 'X_umap' layers: 'spliced', 'unspliced' obsp: 'connectivities', 'distances'
Filter out genes which don't have enough spliced/unspliced counts, normalize and log transform the data and restrict to the top highly variable genes. Further, compute principal components and moments for velocity estimation. These are standard scanpy/scvelo functions, for more information about them, see the scVelo API.
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000) sc.tl.pca(adata) sc.pp.neighbors(adata, n_pcs=30, n_neighbors=30) scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
Filtered out 22024 genes that are detected 20 counts (shared). Normalized count data: X, spliced, unspliced. Exctracted 2000 highly variable genes. Logarithmized X. computing moments based on connectivities finished (0:00:00) --> added 'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
We will use the dynamical model from scVelo to estimate the velocities. The first step, estimating the parameters of the dynamical model, may take a while (~10min). To make sure we only have to run this once, we developed a caching extension called scachepy. scachepy does not only work for
recover_dynamics, but it can cache the output of almost any scanpy or scvelo function. To install it, simply run
pip install git+https://github.com/theislab/scachepy
If you don't want to install scachepy now, don't worry, the below cell will run without it as well and this is the only place in this tutorial where we're using it.
try: import scachepy c = scachepy.Cache('../../cached_files/basic_tutorial/') c.tl.recover_dynamics(adata, force=False) except ModuleNotFoundError: print("You don't seem to have scachepy installed, but that's fine, you just have to be a bit patient (~10min). ") scv.tl.recover_dynamics(adata)
You don't seem to have scachepy installed, but that's fine, you just have to be a bit patient (~10min). recovering dynamics finished (0:12:00) --> added 'fit_pars', fitted parameters for splicing dynamics (adata.var)
Once we have the parameters, we can use these to compute the velocities and the velocity graph. The velocity graph is a weighted graph that specifies how likely two cells are to transition into another, given their velocity vectors and relative positions.
scv.tl.velocity(adata, mode='dynamical') scv.tl.velocity_graph(adata)
computing velocities finished (0:00:04) --> added 'velocity', velocity vectors for each individual cell (adata.layers) computing velocity graph finished (0:00:08) --> added 'velocity_graph', sparse matrix with cosine correlations (adata.uns)
scv.pl.velocity_embedding_stream(adata, basis='umap', legend_fontsize=12, title='', smooth=.8, min_mass=4)
computing velocity embedding finished (0:00:01) --> added 'velocity_umap', embedded velocity vectors (adata.obsm)
CellRank is a package for analyzing directed single cell data, whereby we mean single cell data that can be respresented via a directed graph. Most prominently, this is the case for single cell data where velocities have been computed - we can use these to direct the KNN graph. However, there are other situations in which we can inform the KNN graph of the direciton of the process, using i.e. pseudotime (see Palantir) or information obtained via mRNA labeling with e.g. scSLAM-seq, scEU-seq or sci-fate. Because we wanted CellRank to be widely applicable, no matter how directionality was introduced to the data, we split it up into two main modules,
estimators. In short,
kernels allow you to compute a (directed) transition matrix, whereas
estimators allow you to analyze it.
To construct a transition matrix, CellRank offers a number of kernel classes in
cellrank.tl.kernels. Currently implemented are the following:
VelocityKernel: compute transition matrix based on RNA velocity.
ConnectivityKernel: compute symmetric transition matrix based on transcriptomic similarity (essentially a DPT kernel).
PalantirKernel: mimics Palantir.
These kernels can be combined by simply using the
* operator, we will demonstrate this below. To find out more, check out the API. Note that the
kernel classes are designed to be easy to extend to incoporate future kernels based on e.g. mRNA labeling or other sources of directionality. Let's start with the
from cellrank.tl.kernels import VelocityKernel vk = VelocityKernel(adata)
To lern more about this object, we can print it:
There is not very much there yet. We can change this by computing the transition matrix:
Computing transition matrix based on logits using `'deterministic'` mode Estimating `softmax_scale` using `'deterministic'` mode
To see how exactly this transition matrix was computed, we can print the kernel again:
<Velo[softmax_scale=3.97, mode=deterministic, seed=44219, scheme=<CorrelationScheme>]>
There's a lot more info now! To find out what all of these mean, check the docstring of
.compute_transition_matrix. The most important bits of information here are
mode='deterministic: by default, the computation is deterministic, but we can also sample from the velocity distribution (
mode='sampling'), get a 2nd order estimate (
mode='stochastic') or a Monte Carlo estimate (
backward=False: run the process in the forward direction. To change this, set
backward=True when initializing the
softmax_scale: scaling factor used in the softmax to transform cosine similarities into probabilities. The larger this value, the more centered the distribution will be around the most likely cell. If
None, use velocity variances to scale the softmax, i.e. an automatic way to tune it in terms of local variance in velocities. This requires one additional run (always in 'deterministic' mode, to quickly estimate the scale).
The velocity kernel we computed above would allow us to reproduce the results from the high level tutorial. However, for the sake of demonstration, let's suppose that our velocities are very noisy and we want to make the analysis more robust by combining the velocity kernel with a connectivity kernel. This is very easy:
from cellrank.tl.kernels import ConnectivityKernel ck = ConnectivityKernel(adata).compute_transition_matrix()
Computing transition matrix based on connectivities Finish (0:00:00)
Note how it's possible to call the
.compute_transition_matrix method direcly when initializing the kernel - this works for all kernel classes. Given these two kernels now, we can combine them:
combined_kernel = 0.8 * vk + 0.2 * ck
Let's print the
combined_kernel to see what happened:
((0.8 * <Velo[softmax_scale=3.97, mode=deterministic, seed=44219, scheme=<CorrelationScheme>]>) + (0.2 * <Conn[dnorm=True]>))
There we go, we took the two computed transition matrices stored in the kernel object and combined them using a weighted mean, with weights given by the factors we provided. We will use the
combined_kernel in the
estimators section below.
Before moving on to the
estimators, let's demonstrate how to set up a
PalantirKernel. For this, we need a pseudotemporal ordering of the cells. Any pseudotime method can be used here. Note that this won't exactly reproduce the original palantir implementation becasue it uses a specific representation of the data and a specific pseudotime. We will simply use DPT here:
root_idx = np.where(adata.obs['clusters'] == 'Ngn3 low EP') adata.uns['iroot'] = root_idx sc.tl.dpt(adata)
WARNING: Trying to run `tl.dpt` without prior call of `tl.diffmap`. Falling back to `tl.diffmap` with default parameters.
Note that we did not use the above
VelocityKernel to infer the initial state here as we assume that in a situation where you want to apply Palantir, you probably don't have acess to velocities!
The last step is to initialize the
PalantirKernel based on the
adata object and the pre-computed diffusion pseudotime. If you want to use another pseudotime, use the
from cellrank.tl.kernels import PalantirKernel pk = PalantirKernel(adata, time_key='dpt_pseudotime').compute_transition_matrix() print(pk)
Computing transition matrix based on Palantir-like kernel Finish (0:00:03) <Pala[k=3, dnorm=True, n_neighs=30]>
Estimators take a
kernel object and offer methods to analyze it. The main objective is to decompose the state space into a set of macrostates that represent the slow-time scale dynamics of the process. A subset of these macrostates will be the initial or terminal states of the process, the remaining states will be intermediate transient states. CellRank currently offers two estimator classes in
CFLARE: Clustering and Filtering Left And Right Eigenvectors. Heuristic method based on the spectrum of the transition matrix.
GPCCA: Generalized Perron Cluster Cluster Analysis: project the Markov chain onto a small set of macrostates using a Galerkin projection which maximizes the self-transition probability for the macrostates, see Reuter et al. (2018).
For more information on the estimators, have a look at the API. We will demonstrate the
GPCCA estimator here, however, the
CFLARE estimator has a similar set of methods (which do different things internally). Let's start by initializing a
GPCCA object based on the
combined_kernel we constructed above:
from cellrank.tl.estimators import GPCCA g = GPCCA(combined_kernel) print(g)
WARNING: Computing transition matrix using the default parameters GPCCA[n=2531, kernel=((0.8 * <Velo[softmax_scale=3.97, mode=deterministic, seed=44219, scheme=<CorrelationScheme>]>) + (0.2 * <Conn[dnorm=True]>))]
Additionaly to the information about the kernel it is based on, this prints out the number of states in the underlying Markov chain. GPCCA needs a real sorted Schur decomposition to work with, so let's start by computing this and visualizing eigenvalues in complex plane:
Computing Schur decomposition WARNING: Using `20` components would split a block of complex conjugates. Increasing `n_components` to `21` When computing macrostates, choose a number of states NOT in `[5, 10, 16, 18, 20]` Adding `.eigendecomposition` `adata.uns['eig_fwd']` `.schur` `.schur_matrix` Finish (0:00:00)
To compute the Schur decomposition, there are two methods implemented
scipy.linalg.schurto compute a full real Schur decomposition and sort it using a python implementation of Brandts (2002). Note that
scipy.linalg.schuronly supports dense matrices, so consider using this for small cell numbers (<10k).
method='krylov': use an interative, krylov-subspace based algorightm provided in SLEPc to directly compute a partial, sorted, real Schur decomposition. This works with sparse matrices and will scale to extremly large cell numbers.
The real Schur decomposition for transition matrix
T is given by
Q U Q**(-1), where
Q is orthogonal and
U is quasi-upper triangular, which means it's upper triangular except for 2x2 blocks on the diagonal. 1x1 blocks on the diagonal represent real eigenvalues, 2x2 blocks on the diagonal represent complex eigenvalues. Above, we plotted the top 20 eigenvalues of the matrix
T to see whether there is an apparent eigengap. In the present case, there seems to be such a gap after the first 3 eigenvalues. We can visualize the corresponding Schur vectors in the embedding: