Simulate some data via Splatter to test BBKNN. Let's go for three batches of three cell types, and port the counts and cell type/batch metadata back to python.
%load_ext rpy2.ipython
%%R -o counts -o meta
suppressMessages(library(splatter))
params <- newSplatParams()
params <- setParam(params, "nGenes", 5000)
params <- setParam(params, "batchCells", c(500,500,500))
params <- setParam(params, "batch.facLoc", 0.5)
params <- setParam(params, "batch.facScale", 0.5)
params <- setParam(params, "group.prob", c(1/3,1/3,1/3))
sim <- splatSimulate(params, method="groups", verbose=FALSE)
counts = data.frame(counts(sim))
meta = data.frame(colData(sim))
import anndata
import scanpy.api as sc
import bbknn
/Users/kp9/anaconda3/envs/orig/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters
sc.settings.set_figure_params(dpi=80) # low dpi (dots per inch) yields small inline figures
Do some basic data processing (scale to 10,000, log-transform), get a PCA.
adata = anndata.AnnData(X=counts.values.T, obs=meta)
sc.pp.normalize_per_cell(adata,counts_per_cell_after=10000)
sc.pp.log1p(adata)
sc.tl.pca(adata,svd_solver='arpack')
Standard neighbour inference procedure without any batch correction leads to a pretty heavy technical effect corrupting the visualisation.
sc.pp.neighbors(adata, n_neighbors=9)
sc.tl.umap(adata)
sc.pl.umap(adata,color=['Batch','Group'])
... storing 'Batch' as categorical ... storing 'Group' as categorical
Applying BBKNN leads to the cell types being reconnected across the batches.
bbknn.bbknn(adata, batch_key='Batch', metric='euclidean')
sc.tl.umap(adata)
sc.pl.umap(adata,color=['Batch','Group'])
Let's inspect the influence of the trimming parameter. To do this, let's trim down the simulated data to a smaller setup - two batches, one has two cell types, the third has three. This is what we're left with on the original UMAP space.
to_del = (adata.obs['Batch'] == 'Batch3') | ((adata.obs['Batch'] == 'Batch2') & (adata.obs['Group'] == 'Group3'))
adata2 = adata[[not i for i in to_del]]
sc.pl.umap(adata2,color=['Batch','Group'])
Running BBKNN (after re-running PCA, as we have a subset of the original cells here) without trimming the graph leads to the population present in only one of the samples being erroneously integrated with an unrelated population.
sc.tl.pca(adata2,svd_solver='arpack')
bbknn.bbknn(adata2, batch_key='Batch', metric='euclidean')
sc.tl.umap(adata2)
sc.pl.umap(adata2,color=['Batch','Group'])
Trimming the graph leads to the population's independence being correctly restored, while not splitting up any of the correctly merged cell types.
bbknn.bbknn(adata2, batch_key='Batch', trim=10, metric='euclidean')
sc.tl.umap(adata2)
sc.pl.umap(adata2,color=['Batch','Group'])