Clustering 3k PBMCs following a Seurat Tutorial

This started out (May 2017) with a demonstration that Scanpy would allow to reproduce most of Seurat's (Satija et al., 2015) clustering tutorial (link), which we gratefully acknowledge. In the meanwhile, we have added and removed several pieces.

The data consists in 3k PBMCs from a Healthy Donor and is freely available from 10x Genomics (here from this webpage).

In [1]:
import numpy as np
import pandas as pd
import scanpy.api as sc

sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
results_file = './write/pbmc3k.h5ad'
scanpy==1.3.5 anndata==0.6.13+17.ga8e2ae2 numpy==1.14.5 scipy==1.1.0 pandas==0.23.4 scikit-learn==0.19.1 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1 
In [2]:
sc.settings.set_figure_params(dpi=80)
In [3]:
adata = sc.read_10x_mtx(
    './data/filtered_gene_bc_matrices/hg19/', var_names='gene_symbols', cache=True)
... reading from cache file ./cache/data-filtered_gene_bc_matrices-hg19-matrix.h5ad
In [4]:
adata.var_names_make_unique()

Preprocessing

Note: In notebooks and jupyter lab, you can see the documentation for a python function by hitting SHIFT + TAB. Hit it twice to expand the view.

Show those genes that yield the highest fraction of counts in each single cells, across all cells.

In [5]:
sc.pl.highest_expr_genes(adata, n_top=20)

Basic filtering.

In [6]:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
filtered out 19024 genes that are detected in less than 3 cells

Plot some information about mitochondrial genes, important for quality control. Note that you can also retrieve mitochondrial genes using sc.queries.mitochondrial_genes_biomart('www.ensembl.org', 'mmusculus').

Citing from "Simple Single Cell" workflows (Lun, McCarthy & Marioni, 2017):

High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.

Note you can also use the function pp.calculate_qc_metrics to computes this and more.

In [7]:
mito_genes = adata.var_names.str.startswith('MT-')
# for each cell compute fraction of counts in mito genes vs. all genes
# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)
adata.obs['percent_mito'] = np.sum(
    adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
# add the total counts per cell as observations-annotation to adata
adata.obs['n_counts'] = adata.X.sum(axis=1).A1

A violin plot of the computed quality measures.

In [8]:
sc.pl.violin(adata, ['n_genes', 'n_counts', 'percent_mito'],
             jitter=0.4, multi_panel=True)

Remove cells that have too many mitochondrial genes expressed or too many total counts.

In [9]:
sc.pl.scatter(adata, x='n_counts', y='percent_mito')
sc.pl.scatter(adata, x='n_counts', y='n_genes')
In [10]:
adata
Out[10]:
AnnData object with n_obs × n_vars = 2700 × 13714 
    obs: 'n_genes', 'percent_mito', 'n_counts'
    var: 'gene_ids', 'n_cells'

Actually do the filtering.

In [11]:
adata = adata[adata.obs['n_genes'] < 2500, :]
adata = adata[adata.obs['percent_mito'] < 0.05, :]

Per-cell normalize (library-size correct) the data matrix $\mathbf{X}$, so that counts become comparable among cells.

In [12]:
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)

Logarithmize the data.

In [13]:
sc.pp.log1p(adata)

Set the .raw attribute of AnnData object to the logarithmized raw gene expression for later use in differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object. While many people consider the normalized data matrix as the "relevant data" for visualization and differential testing, some would prefer to store the unnormalized data.

In [14]:
adata.raw = adata

Identify highly-variable genes.

In [15]:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', boolean vector (adata.var)
    'dispersions', boolean vector (adata.var)
    'dispersions_norm', boolean vector (adata.var)
In [16]:
sc.pl.highly_variable_genes(adata)

Actually do the filtering.

In [17]:
adata = adata[:, adata.var['highly_variable']]

Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed. Scale the data to unit variance.

In [18]:
sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
regressing out ['n_counts', 'percent_mito']
    sparse input is densified and may lead to high memory use
    finished (0:00:06.73)

Scale each gene to unit variance. Clip values exceeding standard deviation 10.

In [19]:
sc.pp.scale(adata, max_value=10)

Save the result.

In [20]:
adata.write(results_file)

PCA

Compute PCA and make a scatter plot.

In [21]:
sc.tl.pca(adata, svd_solver='arpack')
In [22]:
sc.pl.pca(adata, color='CST3')

Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells, e.g. used in the clustering function sc.tl.louvain() or tSNE sc.tl.tsne(). In our experience, often, a rough estimate of the number of PCs does fine. Seurat provides many more functions, here.

In [23]:
sc.pl.pca_variance_ratio(adata, log=True)
In [24]:
adata.write(results_file)
In [25]:
adata
Out[25]:
AnnData object with n_obs × n_vars = 2638 × 1838 
    obs: 'n_genes', 'percent_mito', 'n_counts'
    var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'pca'
    obsm: 'X_pca'
    varm: 'PCs'

Computing the neighborhood graph

In [26]:
adata = sc.read(results_file)

Let us compute the neighborhood graph of cells using the PCA representation of the data matrix. You might simply use default values here. For the sake of reproducing Seurat's results, let's take the following values.

In [27]:
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
computing neighbors
    using 'X_pca' with n_pcs = 40
    finished (0:00:04.61) --> added to `.uns['neighbors']`
    'distances', weighted adjacency matrix
    'connectivities', weighted adjacency matrix

We now advertise visualizing the data using UMAP, see below. In particular, if you have large data, this will give you a notable speedup. Also, it is potentially more faithful to global topology: trajectories are better preserved.

In [28]:
sc.tl.umap(adata)
computing UMAP
    finished (0:00:07.29) --> added
    'X_umap', UMAP coordinates (adata.obsm)
In [29]:
sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'])