This tutorial shows how to work with CellRank using the **low-level mode**. We will interact directly with CellRank's two main modules, kernels and estimators. We assume that you have gone through the high level tutorial already.

The first part of this tutorial is very similar to scVelo's tutorial on pancreatic endocrinogenesis. This is essentially the same as in the high level tutorial, so feel free to skip the beginning and go directly to the section Run CellRank. The data we use here comes from Bastidas-Ponce et al. (2018). For more info on scVelo, see the documentation or read the article.

This tutorial notebook can be downloaded using the following link.

Easiest way to start is to download Miniconda3 along with the environment file found here. To create the environment, run `conda create -f environment.yml`

.

In [1]:

```
import scvelo as scv
import scanpy as sc
import cellrank as cr
import numpy as np
scv.settings.verbosity = 3
scv.settings.set_figure_params('scvelo')
cr.settings.verbosity = 2
```

First, we need to get the data. The following commands will download the `adata`

object and save it under `datasets/endocrinogenesis_day15.5.h5ad`

.

In [2]:

```
adata = cr.datasets.pancreas()
scv.utils.show_proportions(adata)
adata
```

Out[2]:

Filter out genes which don't have enough spliced/unspliced counts, normalize and log transform the data and restrict to the top highly variable genes. Further, compute principal components and moments for velocity estimation. These are standard scanpy/scvelo functions, for more information about them, see the scVelo API.

In [3]:

```
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=30)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
```

We will use the dynamical model from scVelo to estimate the velocities. The first step, estimating the parameters of the dynamical model, may take a while (~10min). To make sure we only have to run this once, we developed a caching extension called scachepy. scachepy does not only work for `recover_dynamics`

, but it can cache the output of almost any scanpy or scvelo function. To install it, simply run

`pip install git+https://github.com/theislab/scachepy`

If you don't want to install scachepy now, don't worry, the below cell will run without it as well and this is the only place in this tutorial where we're using it.

In [4]:

```
try:
import scachepy
c = scachepy.Cache('../../cached_files/basic_tutorial/')
c.tl.recover_dynamics(adata, force=False)
except ModuleNotFoundError:
print("You don't seem to have scachepy installed, but that's fine, you just have to be a bit patient (~10min). ")
scv.tl.recover_dynamics(adata)
```

Once we have the parameters, we can use these to compute the velocities and the velocity graph. The velocity graph is a weighted graph that specifies how likely two cells are to transition into another, given their velocity vectors and relative positions.

In [5]:

```
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
```

In [6]:

```
scv.pl.velocity_embedding_stream(adata, basis='umap', legend_fontsize=12, title='', smooth=.8, min_mass=4)
```

CellRank is a package for analyzing directed single cell data, whereby we mean single cell data that can be respresented via a directed graph. Most prominently, this is the case for single cell data where velocities have been computed - we can use these to direct the KNN graph. However, there are other situations in which we can inform the KNN graph of the direciton of the process, using i.e. pseudotime (see Palantir) or information obtained via mRNA labeling with e.g. scSLAM-seq, scEU-seq or sci-fate. Because we wanted CellRank to be widely applicable, no matter how directionality was introduced to the data, we split it up into two main modules, `kernels`

and `estimators`

. In short, `kernels`

allow you to compute a (directed) transition matrix, whereas `estimators`

allow you to analyze it.

To construct a transition matrix, CellRank offers a number of kernel classes in `cellrank.tl.kernels`

. Currently implemented are the following:

`VelocityKernel`

: compute transition matrix based on RNA velocity.`ConnectivityKernel`

: compute symmetric transition matrix based on transcriptomic similarity (essentially a DPT kernel).`PalantirKernel`

: mimics Palantir.

These kernels can be combined by simply using the `+`

or `*`

operator, we will demonstrate this below. To find out more, check out the API. Note that the `kernel`

classes are designed to be easy to extend to incoporate future kernels based on e.g. mRNA labeling or other sources of directionality. Let's start with the `VelocityKernel`

:

In [7]:

```
from cellrank.tl.kernels import VelocityKernel
vk = VelocityKernel(adata)
```

To lern more about this object, we can print it:

In [8]:

```
print(vk)
```

There is not very much there yet. We can change this by computing the transition matrix:

In [9]:

```
vk.compute_transition_matrix()
```

Out[9]:

To see how exactly this transition matrix was computed, we can print the kernel again:

In [10]:

```
print(vk)
```

There's a lot more info now! To find out what all of these mean, check the docstring of `.compute_transition_matrix`

. The most important bits of information here are

`mode='deterministic`

: by default, the computation is deterministic, but we can also sample from the velocity distribution (`mode='sampling'`

), get a 2nd order estimate (`mode='stochastic'`

) or a Monte Carlo estimate (`mode='monte_carlo'`

).`backward=False`

: run the process in the forward direction. To change this, set`backward=True`

when initializing the`VelocityKernel`

`softmax_scale`

: scaling factor used in the softmax to transform cosine similarities into probabilities. The larger this value, the more centered the distribution will be around the most likely cell. If`None`

, use velocity variances to scale the softmax, i.e. an automatic way to tune it in terms of local variance in velocities. This requires one additional run (always in 'deterministic' mode, to quickly estimate the scale).

The velocity kernel we computed above would allow us to reproduce the results from the high level tutorial. However, for the sake of demonstration, let's suppose that our velocities are very noisy and we want to make the analysis more robust by combining the velocity kernel with a connectivity kernel. This is very easy:

In [11]:

```
from cellrank.tl.kernels import ConnectivityKernel
ck = ConnectivityKernel(adata).compute_transition_matrix()
```

Note how it's possible to call the `.compute_transition_matrix`

method direcly when initializing the kernel - this works for all kernel classes. Given these two kernels now, we can combine them:

In [12]:

```
combined_kernel = 0.8 * vk + 0.2 * ck
```

Let's print the `combined_kernel`

to see what happened:

In [13]:

```
print(combined_kernel)
```

There we go, we took the two computed transition matrices stored in the kernel object and combined them using a weighted mean, with weights given by the factors we provided. We will use the `combined_kernel`

in the `estimators`

section below.

Before moving on to the `estimators`

, let's demonstrate how to set up a `PalantirKernel`

. For this, we need a pseudotemporal ordering of the cells. Any pseudotime method can be used here. Note that this won't exactly reproduce the original palantir implementation becasue it uses a specific representation of the data and a specific pseudotime. We will simply use DPT here:

In [14]:

```
root_idx = np.where(adata.obs['clusters'] == 'Ngn3 low EP')[0][0]
adata.uns['iroot'] = root_idx
sc.tl.dpt(adata)
```

Note that we did not use the above `VelocityKernel`

to infer the initial state here as we assume that in a situation where you want to apply Palantir, you probably don't have acess to velocities!

The last step is to initialize the `PalantirKernel`

based on the `adata`

object and the pre-computed diffusion pseudotime. If you want to use another pseudotime, use the `time_key`

keyword.

In [15]:

```
from cellrank.tl.kernels import PalantirKernel
pk = PalantirKernel(adata, time_key='dpt_pseudotime').compute_transition_matrix()
print(pk)
```

Estimators take a `kernel`

object and offer methods to analyze it. The main objective is to decompose the state space into a set of macrostates that represent the slow-time scale dynamics of the process. A subset of these macrostates will be the initial or terminal states of the process, the remaining states will be intermediate transient states. CellRank currently offers two estimator classes in `cellrank.tl.estimators`

:

`CFLARE`

:**C**lustering and**F**iltering**L**eft**A**nd**R**ight**E**igenvectors. Heuristic method based on the spectrum of the transition matrix.`GPCCA`

:**G**eneralized**P**erron**C**luster**C**luster**A**nalysis: project the Markov chain onto a small set of macrostates using a Galerkin projection which maximizes the self-transition probability for the macrostates, see Reuter et al. (2018).

For more information on the estimators, have a look at the API. We will demonstrate the `GPCCA`

estimator here, however, the `CFLARE`

estimator has a similar set of methods (which do different things internally). Let's start by initializing a `GPCCA`

object based on the `combined_kernel`

we constructed above:

In [16]:

```
from cellrank.tl.estimators import GPCCA
g = GPCCA(combined_kernel)
print(g)
```

Additionaly to the information about the kernel it is based on, this prints out the number of states in the underlying Markov chain. GPCCA needs a real sorted Schur decomposition to work with, so let's start by computing this and visualizing eigenvalues in complex plane:

In [17]:

```
g.compute_schur(n_components=20)
g.plot_spectrum()
```

To compute the Schur decomposition, there are two methods implemented

`method='brandts'`

: use`scipy.linalg.schur`

to compute a full real Schur decomposition and sort it using a python implementation of Brandts (2002). Note that`scipy.linalg.schur`

only supports dense matrices, so consider using this for small cell numbers (<10k).`method='krylov'`

: use an interative, krylov-subspace based algorightm provided in SLEPc to directly compute a partial, sorted, real Schur decomposition. This works with sparse matrices and will scale to extremly large cell numbers.

The real Schur decomposition for transition matrix `T`

is given by `Q U Q**(-1)`

, where `Q`

is orthogonal and `U`

is quasi-upper triangular, which means it's upper triangular except for 2x2 blocks on the diagonal. 1x1 blocks on the diagonal represent real eigenvalues, 2x2 blocks on the diagonal represent complex eigenvalues. Above, we plotted the top 20 eigenvalues of the matrix `T`

to see whether there is an apparent *eigengap*. In the present case, there seems to be such a gap after the first 3 eigenvalues. We can visualize the corresponding Schur vectors in the embedding:

In [18]:

```
g.plot_schur(use=3)
```