v0.0.5
Imperial BRC Genomics Facility
Imperial BRC Genomics Facility
imperialgenomicsfacility/scanpy-notebook-image:release-v0.0.4
2021-May-04 12:32
This notebook for running single cell data analysis (for a single sample) using Scanpy package. Most of the codes and documentation used in this notebook has been copied from the following sources:
We need to load all the required libraries to environment before we can run any of the analysis steps. Also, we are checking the version information for most of the major packages used for analysis.
%matplotlib inline
import os
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import matplotlib.pyplot as plt
from copy import deepcopy
import plotly.graph_objs as go
from IPython.display import HTML
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
sc.settings.verbosity = 0
sc.logging.print_header()
sns.set_theme(context='notebook', style='darkgrid', palette='colorblind')
plt.rcParams['figure.dpi'] = 150
We are setting the output file path to $/tmp/scanpy\_output.h5ad$
results_file = '/tmp/scanpy_output.h5ad'
The following steps are only required for downloading test data from 10X Genomics's website.
%%bash
rm -rf cache
rm -rf /tmp/data
mkdir -p /tmp/data
wget -q -O /tmp/data/pbmc3k_filtered_gene_bc_matrices.tar.gz \
/tmp/data http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
cd /tmp/data
tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
Load the Cellranger output to Scanpy
adata = \
sc.read_10x_mtx(
'/tmp/data/filtered_gene_bc_matrices/hg19/',
var_names='gene_symbols',
cache=True)
Converting the gene names to unique values
adata.var_names_make_unique()
Checking the data dimensions before checking QC
adata
Scanpy stores the count data is an annotated data matrix ($observations$ e.g. cell barcodes × $variables$ e.g gene names) called AnnData together with annotations of observations($obs$), variables ($var$) and unstructured annotations ($uns$)
Computing fraction of counts assigned to each gene over all cells. The top genes with the highest mean fraction over all cells are plotted as boxplots.
fig,ax = plt.subplots(1,1,figsize=(4,5))
sc.pl.highest_expr_genes(adata, n_top=20,ax=ax)
Single-Cell Remover of Doublet predicts cell doublets using a nearest-neighbor classifier of observed transcriptomes and simulated doublets.
sc.external.pp.scrublet(adata)
Scrublet tries to identify dublets using a threshold doublet score (ideally set at the minimum of the simulated doublet histogram).
For more information, check this tutorial notebook
sc.external.pl.scrublet_score_distribution(adata);
Checking $obs$ section of the AnnData object
adata.obs.head()
Checking the $var$ section of the AnnData object
adata.var.head()
Listing the Mitochondrial genes detected in the cell population
mt_genes = 0
mt_genes = [gene for gene in adata.var_names if gene.startswith('MT-')]
mito_genes = adata.var_names.str.startswith('MT-')
if len(mt_genes)==0:
print('Looking for mito genes with "mt-" prefix')
mt_genes = [gene for gene in adata.var_names if gene.startswith('mt-')]
mito_genes = adata.var_names.str.startswith('mt-')
if len(mt_genes)==0:
print("No mitochondrial genes found")
else:
print("Mitochondrial genes: count: {0}, lists: {1}".format(len(mt_genes),mt_genes))
Typical quality measures for assessing the quality of a cell includes the following components
We are calculating the above mentioned details using the following codes
adata.obs['mito_counts'] = np.sum(adata[:, mito_genes].X, axis=1).A1
adata.obs['percent_mito'] = \
np.sum(adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
adata.obs['n_counts'] = adata.X.sum(axis=1).A1
adata.obs['log_counts'] = np.log(adata.obs['n_counts'])
adata.obs['n_genes'] = (adata.X > 0).sum(1)
Checking $obs$ section of the AnnData object again
adata.obs.head()
Sorting barcodes based on the $percent\_mito$ column
adata.obs.sort_values('percent_mito',ascending=False).head()
A high fraction of mitochondrial reads being picked up can indicate cell stress, as there is a low proportion of nuclear mRNA in the cell. It should be noted that high mitochondrial RNA fractions can also be biological signals indicating elevated respiration.
Cell barcodes with high count depth, few detected genes and high fraction of mitochondrial counts may indicate cells whose cytoplasmic mRNA has leaked out due to a broken membrane and only the mRNA located in the mitochondria has survived.
Cells with high UMI counts and detected genes may represent doublets (it requires further checking).
adata.obs['predicted_doublet_str'] = adata.obs['predicted_doublet'].map(lambda x: 'dublets' if x else 'NA').astype(str)
plt.rcParams['figure.figsize']=(8,6)
ax = sc.pl.scatter(adata, 'n_counts', 'n_genes', color='predicted_doublet_str',show=False,palette=sns.color_palette('colorblind'))
ax.set_title('Predicted dublets', fontsize=12)
ax.set_xlabel("Count depth",fontsize=12)
ax.set_ylabel("Number of genes",fontsize=12)
ax.tick_params(labelsize=12)
plt.rcParams['figure.figsize']=(8,5)
sc.pl.violin(\
adata,
['n_genes', 'n_counts', 'percent_mito'],
jitter=0.4,
log=True,
stripplot=True,
show=False,
multi_panel=False)
Violin plot (above) shows the computed quality measures of UMI counts, gene counts and fraction of mitochondrial counts.
plt.rcParams['figure.figsize']=(8,6)
ax = sc.pl.scatter(adata, 'n_counts', 'n_genes', color='percent_mito',show=False,)
ax.set_title('Fraction mitochondrial counts', fontsize=12)
ax.set_xlabel("Count depth",fontsize=12)
ax.set_ylabel("Number of genes",fontsize=12)
ax.tick_params(labelsize=12)
ax.axhline(700, 0,1, color='red')
ax.axvline(1500, 0,1, color='red')
The above scatter plot shows number of genes vs number of counts with $MT$ fraction information. We will be using a cutoff of 1500 counts and 700 genes (red lines) to filter out dying cells.
plt.rcParams['figure.figsize']=(8,6)
ax = sc.pl.scatter(adata[adata.obs['n_counts']<10000], 'n_counts', 'n_genes', color='percent_mito',show=False)
ax.set_title('Fraction mitochondrial counts', fontsize=12)
ax.set_xlabel("Count depth",fontsize=12)
ax.set_ylabel("Number of genes",fontsize=12)
ax.tick_params(labelsize=12)
ax.axhline(700, 0,1, color='red')
ax.axvline(1500, 0,1, color='red')
A similar scatter plot, but this time we have restricted the counts to below 10K
plt.rcParams['figure.figsize']=(8,6)
count_data = adata.obs['n_counts'].copy()
count_data.sort_values(inplace=True, ascending=False)
order = range(1, len(count_data)+1)
ax = plt.semilogy(order, count_data, 'b-')
plt.gca().axhline(1500, 0,1, color='red')
plt.xlabel("Barcode rank", fontsize=12)
plt.ylabel("Count depth", fontsize=12)
plt.tick_params(labelsize=12)
The above plot is similar to UMI counts vs Barcodes plot of Cellranger report and it shows the count depth distribution from high to low count depths. This plot can be used to decide the threshold of count depth to filter out empty droplets.
plt.rcParams['figure.figsize']=(8,6)
ax = sns.histplot(adata.obs['n_counts'], kde=False)
ax.set_xlabel("Count depth",fontsize=12)
ax.set_ylabel("Frequency",fontsize=12)
ax.axvline(1500, 0,1, color='red')
The above histogram plot shows the distribution of count depth and the red line marks the count threshold 1500.
plt.rcParams['figure.figsize']=(8,6)
if (adata.obs['n_counts'].max() - 10000)> 10000:
print('Checking counts above 10K')
ax = sns.histplot(adata.obs['n_counts'][adata.obs['n_counts']>10000], kde=False, bins=60)
ax.set_xlabel("Count depth",fontsize=12)
ax.set_ylabel("Frequency",fontsize=12)
else:
print("Skip checking counts above 10K")
plt.rcParams['figure.figsize']=(8,6)
if adata.obs['n_counts'].max() > 2000:
print('Zooming into first 2000 counts')
ax = sns.histplot(adata.obs['n_counts'][adata.obs['n_counts']<2000], kde=False, bins=60)
ax.set_xlabel("Count depth",fontsize=12)
ax.set_ylabel("Frequency",fontsize=12)
ax.axvline(1500, 0,1, color='red')
else:
print("Failed to zoom into the counts below 2K")
ax = sns.histplot(adata.obs['n_genes'], kde=False)
ax.set_xlabel("Number of genes",fontsize=12)
ax.set_ylabel("Frequency",fontsize=12)
ax.tick_params(labelsize=12)
ax.axvline(700, 0,1, color='red')
The above histogram plot shows the distribution of gene counts and the red line marks the gene count threshold 700.
if adata.obs['n_genes'].max() > 1000:
print('Zooming into first 1000 gene counts')
ax = sns.histplot(adata.obs['n_genes'][adata.obs['n_genes']<1000], kde=False,bins=60)
ax.set_xlabel("Number of genes",fontsize=12)
ax.set_ylabel("Frequency",fontsize=12)
ax.tick_params(labelsize=12)
ax.axvline(700, 0,1, color='red')
else:
print("Failed to zoom into the gene counts below 1K")
We use a permissive filtering threshold of 1500 counts and 700 gene counts to filter out the dying cells or empty droplets with ambient RNA.
adata.var['cells_per_gene'] = np.sum(adata.X > 0, 0).T
ax = sns.histplot(adata.var['cells_per_gene'][adata.var['cells_per_gene'] < 100], kde=False, bins=60)
ax.set_xlabel("Number of cells",fontsize=12)
ax.set_ylabel("Frequency",fontsize=12)
ax.set_title('Cells per gene', fontsize=12)
ax.tick_params(labelsize=12)
ax = sc.pl.scatter(adata, x='n_counts', y='percent_mito',show=False)
ax.set_title('Count depth vs Fraction mitochondrial counts', fontsize=12)
ax.set_xlabel("Count depth",fontsize=12)
ax.set_ylabel("Fraction mitochondrial counts",fontsize=12)
ax.tick_params(labelsize=12)
ax.axhline(0.2, 0,1, color='red')
The scatter plot showing the count depth vs MT fraction counts and the red line shows the default cutoff value for MT fraction 0.2
Now we need to filter the cells which doesn't match our thresholds.
print('Total number of cells: {0}'.format(adata.n_obs))
min_counts_threshold = 1500
max_counts_threshold = 40000
min_gene_counts_threshold = 700
max_mito_pct_threshold = 0.2
sc.pp.filter_cells(adata, min_counts = min_counts_threshold)
print('Number of cells after min count ({0}) filter: {1}'.format(min_counts_threshold,adata.n_obs))
sc.pp.filter_cells(adata, max_counts = max_counts_threshold)
print('Number of cells after max count ({0}) filter: {1}'.format(max_counts_threshold,adata.n_obs))
sc.pp.filter_cells(adata, min_genes = min_gene_counts_threshold)
print('Number of cells after gene ({0}) filter: {1}'.format(min_gene_counts_threshold,adata.n_obs))
adata = adata[adata.obs['percent_mito'] < max_mito_pct_threshold]
print('Number of cells after MT fraction ({0}) filter: {1}'.format(max_mito_pct_threshold,adata.n_obs))
print('Total number of cells after filtering: {0}'.format(adata.n_obs))
Also, we need to filter out any genes that are detected in only less than 20 cells. This operation reduces the dimensions of the matrix by removing zero count genes which are not really informative.
min_cell_per_gene_threshold = 20
print('Total number of genes: {0}'.format(adata.n_vars))
sc.pp.filter_genes(adata, min_cells=min_cell_per_gene_threshold)
print('Number of genes after cell filter: {0}'.format(adata.n_vars))
sc.pp.normalize_total(adata, target_sum=1e4)
We are using a simple total-count based normalization (library-size correct) to transform the data matrix $X$ to 10,000 reads per cell, so that counts become comparable among cells.
sc.pp.log1p(adata)
Then logarithmize the data matrix
adata.raw = adata
Copying the normalized and logarithmized raw gene expression data to the .raw
attribute of the AnnData object for later use.
Following codes blocks are used to identify the highly variable genes (HGV) to further reduce the dimensionality of the dataset and to include only the most informative genes. HGVs will be used for clustering, trajectory inference, and dimensionality reduction/visualization.
plt.rcParams['figure.figsize']=(7,5)
sc.pp.highly_variable_genes(adata, flavor='seurat',n_top_genes=2000)
seurat_hgv = np.sum(adata.var['highly_variable'])
print("Counts of HGVs: {0}".format(seurat_hgv))
sc.pl.highly_variable_genes(adata)
We use a 'seurat' flavor based HGV detection step. Then, we run the following codes to do the actual filtering of data. The plots show how the data was normalized to select highly variable genes irrespective of the mean expression of the genes. This is achieved by using the index of dispersion which divides by mean expression, and subsequently binning the data by mean expression and selecting the most variable genes within each bin.
adata = adata[:, adata.var.highly_variable]
Normalization scales count data to make gene counts comparable between cells. But it still contain unwanted variability. One of the most prominent technical covariates in single-cell data is count depth. Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed can improve the performance of trajectory inference algorithms.
sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
Scale each gene to unit variance. Clip values exceeding standard deviation 10.
sc.pp.scale(adata, max_value=10)
Reduce the dimensionality of the data by running principal component analysis (PCA), which reveals the main axes of variation and denoises the data.
sc.tl.pca(adata, svd_solver='arpack')
plt.rcParams['figure.figsize']=(8,5)
sc.pl.pca(adata,color=['CST3'],size=40)
Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells.
The first principle component captures variation in count depth between cells and its marginally informative.
sc.pl.pca_variance_ratio(adata, log=True)
Let us compute the neighborhood graph of cells using the PCA representation of the data matrix.
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
Scanpy documentation recommends the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag et al. (2018). Note that Leiden clustering (using resolution=1
) directly clusters the neighborhood graph of cells, which we have already computed in the previous section.
sc.tl.leiden(adata,resolution=1)
cluster_length = len(adata.obs['leiden'].value_counts().to_dict().keys())
cluster_colors = sns.color_palette('colorblind',cluster_length,as_cmap=True)
sc.pl.pca(adata,color='leiden',palette=cluster_colors,size=40)
The above PCA plot does not show the expected clustering of the data in two dimensions.
Scanpy documentation suggests embedding the graph in 2 dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preservers trajectories.
We are computing a 3D UMAP for a 3D plot then overwriting it with a 2 dimension UMAP.
leiden_series = deepcopy(adata.obs['leiden'])
cell_clusters = list(leiden_series.value_counts().to_dict().keys())
colors = cluster_colors
dict_map = dict(zip(cell_clusters,colors))
color_map = leiden_series.map(dict_map).values
labels = list(adata.obs.index)
sc.tl.umap(
adata,
n_components=3)
hovertext = \
['cluster: {0}, barcode: {1}'.\
format(
grp,labels[index])
for index,grp in enumerate(leiden_series.values)]
## plotting 3D UMAP as html file
plot(
[go.Scatter3d(
x=adata.obsm['X_umap'][:, 0],
y=adata.obsm['X_umap'][:, 1],
z=adata.obsm['X_umap'][:, 2],
mode='markers',
marker=dict(color=color_map,
size=5),
opacity=0.6,
text=labels,
hovertext=hovertext,
)],
filename='UMAP-3D-plot.html')
sc.tl.umap(adata,n_components=2)
plt.rcParams['figure.figsize']=(8,5)
sc.pl.umap(adata,color=['CST3'],size=40)
You can replace the ['CST3']
in the previous cell with your preferred list of genes.
e.g. ['LTB','IL32','CD3D']
or may be with a Python onliner code to extract gene names with a specific prefix
e.g.
sc.pl.umap(adata, color=[gene for gene in adata.var_names if gene.startswith('CD')], ncols=2)
Plot the scaled and corrected gene expression by use_raw=False
and color them based on the leiden clusters.
sc.pl.umap(adata, color=['leiden'],use_raw=False,palette=cluster_colors,size=40)
adata.obs['predicted_doublet_str'] = adata.obs['predicted_doublet'].map(lambda x: 'dublets' if x else 'NA').astype(str)
sc.pl.umap(adata, color=['predicted_doublet_str'],use_raw=False,palette=cluster_colors,size=40)
sc.tl.tsne(adata,n_pcs=40)
sc.pl.tsne(adata, color=['leiden'],palette=cluster_colors,size=40)
sc.tl.diffmap(adata)
sc.tl.draw_graph(adata)
sc.pl.diffmap(adata,color='leiden',size=40,palette=cluster_colors)
sc.pl.draw_graph(adata,color='leiden',size=40,palette=cluster_colors)
Diffusion Map
Graph
PHATE (Potential of Heat-diffusion for Affinity-based Trajectory Embedding) is a tool for visualizing high dimensional data. PHATE uses a novel conceptual framework for learning and visualizing the manifold to preserve both local and global distances. For more details, check here
sc.external.tl.phate(adata)
plt.rcParams['figure.figsize']=(8,5)
sc.external.pl.phate(adata,color='leiden',size=40,palette=cluster_colors)
sc.external.tl.phenograph(adata,clustering_algo='leiden')
adata.obs['pheno_leiden'] = adata.obs['pheno_leiden'].astype(str)
sc.pl.umap(adata,color ="pheno_leiden",palette=cluster_colors,size=40)
plt.rcParams['figure.figsize']=(4,4)
sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False,ncols=2)
The result of a Wilcoxon rank-sum (Mann-Whitney-U) test is very similar (Sonison & Robinson (2018)).
plt.rcParams['figure.figsize']=(4,4)
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False,ncols=2)
Show the 5 top ranked genes per cluster 0, 1, …, n in a dataframe
HTML(pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5).to_html())
sc.pl.rank_genes_groups_stacked_violin(
adata, n_genes=5,groupby='leiden',swap_axes=False,figsize=(20,10))
sc.pl.rank_genes_groups_dotplot(
adata, n_genes=5,groupby='leiden', dendrogram=True,figsize=(20,10))
sc.pl.rank_genes_groups_matrixplot(adata, n_genes=5, groupby='leiden', figsize=(20,10))
sc.pl.rank_genes_groups_heatmap(
adata, n_genes=5, show_gene_labels=True, groupby='leiden', figsize=(20,10))
sc.pl.rank_genes_groups_tracksplot(adata, n_genes=5, cmap='bwr',figsize=(20,30))
cell_marker_list = None
plt.rcParams['figure.figsize']=(10,10)
if cell_marker_list is not None and \
os.path.exists(cell_marker_list):
df = pd.read_csv(cell_marker_list,delimiter='\t')
markers_df = df[(df['species'] == 'Hs') | (df['species'] == 'Mm Hs')]
cell_types = list(markers_df['cell type'].unique())
markers_dict = {}
for ctype in cell_types:
df = markers_df[markers_df['cell type'] == ctype]
markers_dict[ctype] = df['official gene symbol'].to_list()
cell_annotation = sc.tl.marker_gene_overlap(adata, markers_dict, key='rank_genes_groups',top_n_markers=20)
cell_annotation_norm = sc.tl.marker_gene_overlap(adata, markers_dict, key='rank_genes_groups', normalize='reference',top_n_markers=20)
plt.rcParams['figure.figsize']=(10,10)
sns.heatmap(cell_annotation_norm.loc[cell_annotation_norm.sum(axis=1) > 0.05,], cbar=False, annot=True)
else:
print('No cell marker list found')
Known sources of technical variation in the data have been investigated and corrected for (e.g. batch, count depth). A known source of biological variation that can explain the data is the cell cycle. Here, a gene list from Macosko et al. is used to score the cell cycle effect in the data and classify cells by cell cycle phase.
%%file /tmp/Macosko_cell_cycle_genes.txt
IG1.S,S,G2.M,M,M.G1
ACD,ABCC5,ANLN,AHI1, AGFG1
ACYP1,ABHD10,AP3D1,AKIRIN2,AGPAT3
ADAMTS1,ANKRD18A,ARHGAP19,ANKRD40,AKAP13
ANKRD10,ASF1B,ARL4A,ANLN,AMD1
APEX2,ATAD2,ARMC1,ANP32B,ANP32E
ARGLU1,BBS2,ASXL1,ANP32E,ANTXR1
ATAD2,BIVM,ATL2,ARHGAP19,BAG3
BARD1,BLM,AURKB,ARL6IP1,BTBD3
BRD7,BMI1,BCLAF1,ASXL1,CBX3
C1orf63,BRCA1,BORA,ATF7IP,CDC42
C7orf41,BRIP1,BRD8,AURKA,CDK7
C14orf142,C5orf42,BUB3,BIRC2,CDKN3
CAPN7,C11orf82,C2orf69,BIRC5,CEP70
CASP2,CALD1,C14orf80,BUB1,CNIH4
CASP8AP2,CALM2,CASP3,CADM1,CTR9
CCNE1,CASP2,CBX5,CCDC88A,CWC15
CCNE2,CCDC14,CCDC107,CCDC90B,DCP1A
CDC6,CCDC84,CCNA2,CCNA2,DCTN6
CDC25A,CCDC150,CCNF,CCNB2,DEXI
CDCA7,CDC7,CDC16,CDC20,DKC1
CDCA7L,CDC45,CDC25C,CDC25B,DNAJB6
CEP57,CDCA5,CDCA2,CDC27,DSP
CHAF1A,CDKN2AIP,CDCA3,CDC42EP1,DYNLL1
CHAF1B,CENPM,CDCA8,CDCA3,EIF4E
CLSPN,CENPQ,CDK1,CENPA,ELP3
CREBZF,CERS6,CDKN1B,CENPE,FAM60A
CTSD,CHML,CDKN2C,CENPF,FAM189B
DIS3,COQ9,CDR2,CEP55,FOPNL
DNAJC3,CPNE8,CENPL,CFLAR,FOXK2
DONSON,CREBZF,CEP350,CIT,FXR1
DSCC1,CRLS1,CFD,CKAP2,G3BP1
DTL,DCAF16,CFLAR,CKAP5,GATA2
E2F1,DEPDC7,CHEK2,CKS1B,GNB1
EIF2A,DHFR,CKAP2,CKS2,GRPEL1
ESD,DNA2,CKAP2L,CNOT10,GSPT1
FAM105B,DNAJB4,CYTH2,CNTROB,GTF3C4
FAM122A,DONSON,DCAF7,CTCF,HIF1A
FLAD1,DSCC1,DHX8,CTNNA1,HMG20B
GINS2,DYNC1LI2,DNAJB1,CTNND1,HMGCR
GINS3,E2F8,ENTPD5,DEPDC1,HSD17B11
GMNN,EIF4EBP2,ESPL1,DEPDC1B,HSPA8
HELLS,ENOSF1,FADD,DIAPH3,ILF2
HOXB4,ESCO2,FAM83D,DLGAP5,JMJD1C
HRAS,EXO1,FAN1,DNAJA1,KDM5B
HSF2,EZH2,FANCD2,DNAJB1,KIAA0586
INSR,FAM178A,G2E3,DR1,KIF5B
INTS8,FANCA,GABPB1,DZIP3,KPNB1
IVNS1ABP,FANCI,GAS1,E2F5,KRAS
KIAA1147,FEN1,GAS2L3,ECT2,LARP1
KIAA1586,GCLM,H2AFX,FAM64A,LARP7
LNPEP,GOLGA8A,HAUS8,FOXM1,LRIF1
LUC7L3,GOLGA8B,HINT3,FYN,LYAR
MCM2,H1F0,HIPK2,G2E3,MORF4L2
MCM4,HELLS,HJURP,GADD45A,MRPL19
MCM5,HIST1H2AC,HMGB2,GAS2L3,MRPS2
MCM6,HIST1H4C,HN1,GOT1,MRPS18B
MDM1,INTS7,HP1BP3,GRK6,MSL1
MED31,KAT2A,HRSP12,GTSE1,MTPN
MRI1,KAT2B,IFNAR1,HCFC1,NCOA3
MSH2,KDELC1,IQGAP3,HMG20B,NFIA
NASP,KIAA1598,KATNA1,HMGB3,NFIC
NEAT1,LMO4,KCTD9,HMMR,NUCKS1
NKTR,LYRM7,KDM4A,HN1,NUFIP2
NPAT,MAN1A2,KIAA1524,HP1BP3,NUP37
NUP43,MAP3K2,KIF5B,HPS4,ODF2
ORC1,MASTL,KIF11,HS2ST1,OPN3
OSBPL6,MBD4,KIF20B,HSPA8,PAK1IP1
PANK2,MCM8,KIF22,HSPA13,PBK
PCDH7,MLF1IP,KIF23,INADL,PCF11
PCNA,MYCBP2,KIFC1,KIF2C,PLIN3
PLCXD1,NAB1,KLF6,KIF5B,PPP2CA
PMS1,NEAT1,KPNA2,KIF14,PPP2R2A
PNN,NFE2L2,LBR,KIF20B,PPP6R3
POLD3,NRD1,LIX1L,KLF9,PRC1
RAB23,NSUN3,LMNB1,LBR,PSEN1
RECQL4,NT5DC1,MAD2L1,LMNA,PTMS
RMI2,NUP160,MALAT1,MCM4,PTTG1
RNF113A,OGT,MELK,MDC1,RAD21
RNPC3,ORC3,MGAT2,MIS18BP1,RAN
SEC62,OSGIN2,MID1,MKI67,RHEB
SKP2,PHIP,MIS18BP1,MLLT4,RPL13A
SLBP,PHTF1,MND1,MZT1,SLC39A10
SLC25A36,PHTF2,NCAPD3,NCAPD2,SNUPN
SNHG10,PKMYT1,NCAPH,NCOA5,SRSF3
SRSF7,POLA1,NCOA5,NEK2,STAG1
SSR3,PRIM1,NDC80,NUF2,SYNCRIP
TAF15,PTAR1,NEIL3,NUP35,TAF9
TIPIN,RAD18,NFIC,NUP98,TCERG1
TOPBP1,RAD51,NIPBL,NUSAP1,TLE3
TRA2A,RAD51AP1,NMB,ODF2,TMEM138
TTC14,RBBP8,NR3C1,ORAOV1,TOB2
UBR7,REEP1,NUCKS1,PBK,TOP1
UHRF1,RFC2,NUMA1,PCF11,TROAP
UNG,RHOBTB3,NUSAP1,PLK1,TSC22D1
USP53,RMI1,PIF1,POC1A,TULP4
VPS72,RPA2,PKNOX1,POM121,UBE2D3
WDR76,RRM1,POLQ,PPP1R10,VANGL1
ZMYND19,RRM2,PPP1R2,PRPSAP1,VCL
ZNF367,RSRC2,PSMD11,PRR11,WIPF2
ZRANB2,SAP30BP,PSRC1,PSMG3,WWC1
,SLC38A2,RANGAP1,PTP4A1,YY1
,SP1,RCCD1,PTPN9,ZBTB7A
,SRSF5,RDH11,PWP1,ZCCHC10
,SVIP,RNF141,QRICH1,ZNF24
,TOP2A,SAP30,RAD51C,ZNF281
,TTC31,SKA3,RANGAP1,ZNF593
,TTLL7,SMC4,RBM8A,
,TYMS,STAT1,RCAN1,
,UBE2T,STIL,RERE,
,UBL3,STK17B,RNF126,
,USP1,SUCLG2,RNF141,
,ZBED5,TFAP2A,RNPS1,
,ZWINT,TIMP1,RRP1,
,,TMEM99,SEPHS1,
,,TMPO,SETD8,
,,TNPO2,SFPQ,
,,TOP2A,SGOL2,
,,TRAIP,SHCBP1,
,,TRIM59,SMARCB1,
,,TRMT2A,SMARCD1,
,,TTF2,SPAG5,
,,TUBA1A,SPTBN1,
,,TUBB,SRF,
,,TUBB2A,SRSF3,
,,TUBB4B,SS18,
,,TUBD1,SUV420H1,
,,UACA,TACC3,
,,UBE2C,THRAP3,
,,VPS25,TLE3,
,,VTA1,TMEM138,
,,WSB1,TNPO1,
,,ZNF587,TOMM34,
,,ZNHIT2,TPX2,
,,,TRIP13,
,,,TSG101,
,,,TSN,
,,,TTK,
,,,TUBB4B,
,,,TXNDC9,
,,,TXNRD1,
,,,UBE2D3,
,,,USP13,
,,,USP16,
,,,VANGL1,
,,,WIBG,
,,,WSB1,
,,,YWHAH,
,,,ZC3HC1,
,,,ZFX,
,,,ZMYM1,
,,,ZNF207,
plt.rcParams['figure.figsize']=(8,5)
cc_genes = pd.read_csv('/tmp/Macosko_cell_cycle_genes.txt', sep=',')
s_genes = cc_genes['S'].dropna()
g2m_genes = cc_genes['G2.M'].dropna()
s_genes = [gene.upper() for gene in s_genes]
g2m_genes = [gene.upper() for gene in g2m_genes]
s_genes_ens = adata.var_names[np.in1d(adata.var_names, s_genes)]
g2m_genes_ens = adata.var_names[np.in1d(adata.var_names, g2m_genes)]
sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes_ens, g2m_genes=g2m_genes_ens)
sc.pl.umap(adata, color='S_score', use_raw=False, size=40 , palette=cluster_colors)
sc.pl.umap(adata, color='G2M_score', use_raw=False, size=40, palette=cluster_colors)
sc.pl.umap(adata, color='phase', use_raw=False, size=40, palette=cluster_colors)
Clustering of single cell data cannot sufficiently explain the cellular heterogeneity as it assumes that the data is composed of distinct groups (e.g. discrete cell types or states). The biological processes that drive the development of the cells are continuous processes. Trajectory inference methods interpret single cell data as a snapshot of the continuous process and findss paths through cellular space that minimize transcriptional changes between neighbouring cells.
Partition-based graph abstraction (PAGA) is a method to reconcile clustering and pseudotemporal ordering. It can be applied to an entire dataset and does not assume that there are continuous trajectories connecting all clusters.
sc.tl.paga(adata, groups='leiden')
adata.uns.get('paga')
plt.rcParams['figure.figsize']=(8,5)
sc.pl.paga(adata, color='leiden',cmap=cluster_colors)
plt.rcParams['figure.figsize']=(8,5)
sc.pl.umap(adata, color='leiden', palette=cluster_colors,size=40)
sc.tl.draw_graph(adata, init_pos='paga')
sc.pl.draw_graph(adata, color='leiden',palette=cluster_colors,size=40)
sc.pl.paga_compare(adata, basis='umap')
Plotting same visualization on the UMAP layout
plt.rcParams['figure.figsize']=(8,5)
fig1, ax1 = plt.subplots()
sc.pl.umap(adata, size=40, ax=ax1,show=False, palette=cluster_colors)
sc.pl.paga(adata, pos=adata.uns['paga']['pos'],show=False, node_size_scale=1, node_size_power=1, ax=ax1, text_kwds={'alpha':0.75})
The UCSC Cell Browser is an interactive browser for single cell data. You can visualize the PCA, t-SNA and UMAP plot of these cells using it. For more details, please check Cellbrowser docs.
adata.obsm
sc.external.exporting.cellbrowser(
adata,
data_name='PBMC-3K',
annot_keys=['leiden', 'percent_mito', 'n_genes', 'n_counts','predicted_doublet','doublet_score','pheno_leiden'],
data_dir='/tmp/ucsc-data',
cluster_field='leiden')
If you are using Binder for running this tutorial, please run next two cells (after removing the #
from the #!cbBuild -i...
) to access the UCSC Cell Browser.
import os
if 'BINDER_LAUNCH_HOST' in os.environ:
url = '<a href="{0}user/{1}/proxy/8080/">Cellbrowser UI</a>'.\
format(
os.environ.get('JUPYTERHUB_BASE_URL'),
os.environ.get('JUPYTERHUB_CLIENT_ID').replace('jupyterhub-user-','')
)
else:
url = '<a href="http://localhost:8080/">Cellbrowser UI</a>'
HTML(url)
#!cbBuild -i /tmp/ucsc-data/cellbrowser.conf -o /tmp/ucsc-tmp -p 8080 2> /dev/null
When you are done, feel free to stop the above cell by clicking on the stop button from the tab menu.