First compiled: June 19, 2017. Updated June 8, 2018.
import numpy as np
from matplotlib import rcParams
import matplotlib.pyplot as pl
import scanpy.api as sc
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=80, color_map='viridis') # low dpi (dots per inch) yields small inline figures
sc.logging.print_versions()
results_file = './write/krumsiek11_blobs.h5ad'
scanpy==1.2.0 anndata==0.6.4+6.gd9727ca numpy==1.13.1 scipy==1.0.0 pandas==0.22.0 scikit-learn==0.19.1 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1
The simulated data that we'll use here describes development from a progenitor cell to four differentiated cell types: monocyte, erythrocyte, megakaryocyte and neutrophil, and only uses the expression levels of a few curated marker genes. See Krumsiek et al. (2009) for the definition of the boolean literature-curated network underlying the simulation and discussion of marker genes. The data has been generated using sc.tl.sim('krumsiek11')
; see here for more details. We know the ground truth, which is hard to establish for real data.
To add disconnected structure to the continuous differentiation manifold, which is almost always present in real data, we add a "noise" clusters from a Gaussian mixture model. In real data this might be due to imperfect sampling or just a very heterogeneous tissue. The clusters are disconnected in the sense that they are not part of the differentiation trajectories. One of the strengths of PAGA is that it is able to identify them as not being part of the ground truth trajectory and leave them out of the proposed structure.
We begin by importing our data, and then add the extra "noise" clusters to it (hence the name krumsiek11_blobs
that will show up from now on). When working on your own data, prepare the count matrix with genes as the columns and cells as the rows, with their names in the first row/column respectively, and import them using sc.read()
. You can also call adata.transpose()
if your files follow a different convention. The parser is robust, and can handle files with commas, tabs or spaces as delimiters. After importing the count matrix, you can add cell annotations to the resulting structure, the information for which you can prepare in an external file with the first column as cell names. You can take a look at nestorowa16 for exact syntax and a download of some appropriately formatted input files. Scanpy can also directly use Cell Ranger count matrices on input, see the Seurat-like tutorial for details.
The actual data provided on input can be either raw counts or log-transformed counts. The count matrix will require some basic preprocessing prior to commencing the analysis, with raw count data processed, for instance, with a simple call of sc.pp.recipe_zheng17()
. See the Seurat tutorial for many more options preprocessing.
adata_krumsiek11 = sc.datasets.krumsiek11()
adata_krumsiek11.obs_names_make_unique()
Observation names are not unique. To make them unique, call `.obs_names_make_unique`. ... storing 'cell_type' as categorical
adata_blobs = sc.datasets.blobs(cluster_std=0.5, n_centers=2)
adata_blobs.var_names = adata_krumsiek11.var_names
adata = adata_krumsiek11.concatenate(adata_blobs, index_unique='-')
adata.uns = adata_krumsiek11.uns
adata.write(results_file)
/Users/alexwolf/_hholtz/01_projects/1512_scanpy/anndata/anndata/readwrite/write.py:112: UserWarning: dict key 0 transformed to str upon writing to h5,using string keys is recommended .format(k)) /Users/alexwolf/_hholtz/01_projects/1512_scanpy/anndata/anndata/readwrite/write.py:112: UserWarning: dict key 159 transformed to str upon writing to h5,using string keys is recommended .format(k)) /Users/alexwolf/_hholtz/01_projects/1512_scanpy/anndata/anndata/readwrite/write.py:112: UserWarning: dict key 319 transformed to str upon writing to h5,using string keys is recommended .format(k)) /Users/alexwolf/_hholtz/01_projects/1512_scanpy/anndata/anndata/readwrite/write.py:112: UserWarning: dict key 459 transformed to str upon writing to h5,using string keys is recommended .format(k)) /Users/alexwolf/_hholtz/01_projects/1512_scanpy/anndata/anndata/readwrite/write.py:112: UserWarning: dict key 619 transformed to str upon writing to h5,using string keys is recommended .format(k))
Since we know the underlying ground truth, we can visualize the simulated data as heatmaps. The darker the heatmap the higher the expression. This is not immediately applicable to experimental data you are likely to have on input, but a heatmap of the trajectories we infer created later on in the notebook may be of more relevance.
_, axs = pl.subplots(ncols=4, figsize=(10, 4), gridspec_kw={'wspace': 0.05, 'left': 0.1})
pl.subplots_adjust(left=0.05, right=0.98, top=0.88)
for ititle, (title, idcs) in enumerate(
[('erythrocyte', (160, 320)), ('neutrophil', (480, 640)),
('monocyte', (0, 160)), ('megakaryocyte', (320, 480)), ]):
pl.sca(axs[ititle])
img = pl.imshow(adata.X[idcs[0]:idcs[1]].T, aspect='auto', interpolation='nearest', cmap='Greys')
if ititle == 0: pl.yticks(range(adata.n_vars), adata.var_names)
else: pl.yticks([])
pl.title('path to\n{} fate'.format(title))
pl.xticks([])
pl.grid(False)
pl.xlabel('time')
axs[ititle].set_frame_on(False)
pl.savefig('./figures/krumsiek11_timeseries_heatmap.pdf')
pl.show()
Now that we have our data, we can visualise it. PAGA's preferred visualisation is the Fruchterman & Reingold algorithm (obtained via sc.tl.draw_graph()
below), but Scanpy has many other options for visualizing the data. We precompute the visualisations to begin with, and then add this extra information to our constantly expanding .h5ad
result archive (that's the sc.read()
and sc.write()
calls bookending the code blocks - storing our analysis results for easy later access every step of the way).
sc.pp.neighbors(adata, n_neighbors=30) # generate a neighborhood graph of single cells
sc.tl.draw_graph(adata) # draw this graph using standard drawing algorithms
computing neighbors using data matrix X directly finished (0:00:02.53) --> added to `.uns['neighbors']` 'distances', weighted adjacency matrix 'connectivities', weighted adjacency matrix drawing single-cell graph using layout "fa" finished (0:00:05.39) --> added 'X_draw_graph_fa', graph_drawing coordinates (adata.obsm)
Now that the visualisations are precomputed, we can produce and save the corresponding plots. It should be noted that PAGA saves its output files in predetermined default locations, you can append to the standard naming schemes if you change save=True
to whatever you want to append to the name. So, for example, if you wanted the tSNE plot to be figures/tsne1.png
instead of the default figures/tsne.png
, you'd call the tSNE plotter function with save='1'
. You can change the default directory by changing sc.settings.figdir
to any string that should be prepended to your path. You can also save the figures by right-clicking in your firefox or other browser and selecting "Save Image As".
sc.pl.draw_graph(adata)
... storing 'blobs' as categorical ... storing 'cell_type' as categorical
adata.write(results_file)
/Users/alexwolf/_hholtz/01_projects/1512_scanpy/anndata/anndata/readwrite/write.py:112: UserWarning: dict key 0 transformed to str upon writing to h5,using string keys is recommended .format(k)) /Users/alexwolf/_hholtz/01_projects/1512_scanpy/anndata/anndata/readwrite/write.py:112: UserWarning: dict key 159 transformed to str upon writing to h5,using string keys is recommended .format(k)) /Users/alexwolf/_hholtz/01_projects/1512_scanpy/anndata/anndata/readwrite/write.py:112: UserWarning: dict key 319 transformed to str upon writing to h5,using string keys is recommended .format(k)) /Users/alexwolf/_hholtz/01_projects/1512_scanpy/anndata/anndata/readwrite/write.py:112: UserWarning: dict key 459 transformed to str upon writing to h5,using string keys is recommended .format(k)) /Users/alexwolf/_hholtz/01_projects/1512_scanpy/anndata/anndata/readwrite/write.py:112: UserWarning: dict key 619 transformed to str upon writing to h5,using string keys is recommended .format(k))
PAGA's follows proceeds in two steps: cluster the single-cell graph and then construct the abstracted graph based on the connectivity between the clusters. The resolution parameter controls how coarse the clustering is - the higher the value, the more clusters you'll get and the smaller they'll be. It's not a bad idea to run the following few bits of code with a couple of different resolutions, trying to find a visually convincing parititon.
adata = sc.read(results_file)
sc.tl.louvain(adata, resolution=2)
running Louvain clustering using the "louvain" package of Traag (2017) finished (0:00:00.17) --> found 10 clusters and added 'louvain', the cluster labels (adata.obs, categorical)
sc.pl.draw_graph(adata, color='louvain', legend_loc='on data')
Let us better resolve the first branching region around cluster 3.
sc.tl.louvain(adata, resolution=1, restrict_to=('louvain', ['4']))
running Louvain clustering using the "louvain" package of Traag (2017) finished (0:00:00.01) --> found 3 clusters and added 'louvain_R', the cluster labels (adata.obs, categorical)
Use the reclustered louvain groups to initialize our clustering, give them simple names.
adata.obs['clusters'] = adata.obs['louvain_R']
adata.rename_categories('clusters', [str(i) for i in range(len(adata.obs['clusters'].cat.categories))])
sc.pl.draw_graph(adata, color='clusters', legend_loc='on data')
Annotate the clusters.
adata.obs['clusters'].cat.categories
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11'], dtype='object')
new_categories = ['0', '1', '2', '3/Neu', '4', '5/Stem', '6', '7/Mo', '8/Ery', '9/Mk', '10', '11']
adata.rename_categories('clusters', new_categories)
del adata.uns['highlights'] # remove the annotation of the single datapoints
sc.pl.draw_graph(adata, color='clusters', legend_loc='on data')
sc.tl.paga(adata, groups='clusters')
running partition-based graph abstraction (PAGA) finished (0:00:00.20) --> added 'paga/connectivities', connectivities adjacency (adata.uns) 'paga/connectivities_tree', connectivities subtree (adata.uns)
There are a number of different ways to visualise the resulting abstracted graph, with the most power going to the layout parameter of the function call. The default is the Fruchterman & Reingold algorithm PAGA in (layout='fr'
), with the most common alternative being Reingold-Tilford (layout='rt'
), which creates a hierarchical structure based on where you specify the tree's roots to be.
In the present case, the result is tree-like, so we use Reingold-Tilford. Seeing how clustering results may vary slightly between machines, ensure that the root
parameter in the call below is pointed at the cluster that has the progenitor cells in it, plus two clusters representative of the noise blobs! If you don't provide a root parameter value but ask for the 'rt'
layout, Scanpy will make an educated guess for you based on the graph topology.
sc.pl.paga(adata)
--> added 'pos', the PAGA positions (adata.uns['paga'])
sc.pl.paga(adata, layout='rt', root=[5, 0, 1])
--> added 'pos', the PAGA positions (adata.uns['paga'])
We can use this layout to generate a single-cell embedding that reflects the same global topology.
sc.tl.draw_graph(adata, init_pos='paga', layout='fr', maxiter=50)
drawing single-cell graph using layout "fr" finished (0:00:01.50) --> added 'X_draw_graph_fr', graph_drawing coordinates (adata.obsm)
X = adata.obsm['X_draw_graph_fr'].copy()
adata.obsm['X_draw_graph_fr'] = X.copy()
# adata.obsm['X_draw_graph_fr'][adata.obs['clusters'] == '0', 1] -= 400
adata.obsm['X_draw_graph_fr'][adata.obs['clusters'] == '0', 0] += 500
# adata.obsm['X_draw_graph_fr'][adata.obs['clusters'] == '1', 1] -= 1000
adata.obsm['X_draw_graph_fr'][adata.obs['clusters'] == '1', 0] -= 500
sc.pl.draw_graph(adata, color='clusters', edges=True, size=20, legend_loc='on data')
Harmonize the colors.
pl.figure(figsize=(8, 2))
for i in range(28):
pl.scatter(i, 1, c=sc.pl.palettes.zeileis_26[i], s=200)
pl.show()
orig_colors = np.array(sc.pl.palettes.zeileis_26)
new_colors = orig_colors.copy()
new_colors[[4, 6]] = orig_colors[[12, 13]] # Stem colors / green
new_colors[[8, 10]] = orig_colors[[5, 10]] # Ery colors / red
new_colors[[9]] = orig_colors[[17]] # Mk early Ery colors / yellow
new_colors[[2, 3]] = orig_colors[[2, 25]] # lymph progenitors / grey
new_colors[[3, 11]] = orig_colors[[6, 7]] # Neu / light blue
new_colors[[7, 12]] = orig_colors[[0, 1]] # Mo / dark blue
adata.uns['clusters_colors'] = new_colors
adata.obs['clusters'].cat.categories = [
'0', '1', '2', '3/Neu\n\n', '4', ' 5/Stem\n\n', '6', '7/Mo\n\n', '8/Ery\n\n', ' 9/Mk\n\n',
'10', '11']
axs = sc.pl.paga_compare(
adata, title='', right_margin=0.2, size=20, edge_width_scale=0.3,
legend_fontsize=12, fontsize=12, frameon=False, show=False)
--> added 'pos', the PAGA positions (adata.uns['paga'])
pl.sca(axs[0])
pl.xlabel('')
pl.ylabel('')
axs[0].set_frame_on(False)
pl.savefig('./figures/paga_compare_simulated.pdf')
pl.show()
adata.write(results_file)
You can also plot the distance along the manifold (pseudotime) - an extension of DPT Haghverdi et al. (2016). To do this, you need to ensure that your object has information on the position of the root stored within it before calling the Louvain clustering, as previously mentioned. Otherwise, the clustering won't know where to anchor the pseudotime and will make no attempt to compute it. You don't necessarily have to pick this value out by hand though, for example nestorowa16 uses the cell type annotation that was imported along with the data to set the root to the index of the first cell tagged as Stem:
adata.uns['iroot'] = np.flatnonzero(adata.obs['cell_types'] == 'Stem')[0]
adata = sc.read(results_file)
sc.tl.dpt(adata)
sc.pl.draw_graph(adata, color='dpt_pseudotime')
WARNING: Trying to run `tl.dpt` without prior call of `tl.diffmap`. Falling back to `tl.diffmap` with default parameters. computing Diffusion Maps using n_comps=15(=n_dcs) eigenvalues of transition matrix [ 1. 1. 0.9994561076 0.9990157485 0.9974681735 0.9971336126 0.9936080575 0.9897011518 0.9837759137 0.9828007817 0.9784398079 0.9745985866 0.9677239656 0.9599461555 0.9582870007] finished (0:00:00.07) --> added 'X_diffmap', diffmap coordinates (adata.obsm) 'diffmap_evals', eigenvalues of transition matrix (adata.uns) computing Diffusion Pseudotime using n_dcs=10 finished (0:00:00.00) --> added 'dpt_pseudotime', the pseudotime (adata.obs)
Now that we've inferred the abstracted graph and a pseudotime, we can visualise the gene expression changes along proposed paths. Notice how when specifying the paths
, we start at the progenitor root (in the case of the Louvain clustering shown in the notebook, cluster 5) and follow the tree down until we reach the cluster tagged with the corresponding cell type in the Fruchterman-Reingold visualisation. If you wished to use this on your own experimental data, you'd probably need to identify some key marker genes using either prior knowledge or sc.tl.rank_gene_groups()
first - don't forget, this is simulated data with only the handful of known crucial genes present, you'd need to reduce the gene space of a typical single cell dataset to make the heatmap work. If you end up following the sc.tl.rank_gene_groups()
path, the Louvain clustering can be accessed by passing rank_by='PAGA_groups'
into the function call. Adding a dedicated method for identifying markers along trajectories in the abstracted graph is a forthcoming functionality expansion, with a number of potential approaches being evaluated.
adata.obs['distance'] = adata.obs['dpt_pseudotime']
paths = [('erythrocytes', [5, 6, 4, 2, 10, 8]),
('neutrophils', [5, 6, 11, 3]),
('monocytes', [5, 6, 4, 2, 7])]
adata.var_names
Index(['Gata2', 'Gata1', 'Fog1', 'EKLF', 'Fli1', 'SCL', 'Cebpa', 'Pu.1', 'cJun', 'EgrNab', 'Gfi1'], dtype='object', name='index')
Simply reorder the gene names to match those of the other datasets.
gene_names = ['Gata2', 'Gata1', 'SCL', 'Fog1', 'EKLF', 'Fli1', 'Cebpa', 'Gfi1', 'Pu.1', 'cJun', 'EgrNab']
_, axs = pl.subplots(ncols=3, figsize=(6, 2.5), gridspec_kw={'wspace': 0.05, 'left': 0.12})
pl.subplots_adjust(left=0.05, right=0.98, top=0.82, bottom=0.2)
for ipath, (descr, path) in enumerate(paths):
_, data = sc.pl.paga_path(
adata, path, gene_names,
show_node_names=False,
ax=axs[ipath],
ytick_fontsize=12,
left_margin=0.15,
n_avg=50,
annotations=['distance'],
show_yticks=True if ipath==0 else False,
show_colorbar=False,
color_map='Greys',
color_maps_annotations={'distance': 'viridis'},
title='{} path'.format(descr),
return_data=True,
show=False)
data.to_csv('./write/paga_path_{}.csv'.format(descr))
pl.savefig('./figures/paga_path_simulated.pdf')
pl.show()