Datasets can be downloaded from https://www.dropbox.com/sh/dfqht1ob89ku99d/AACI5ZW3aRuq9MhBfSNS_1O_a?dl=0
Ref: Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20-31 (2016).

In [1]:
%matplotlib inline
In [2]:
import stream as st
In [3]:
st.__version__
Out[3]:
'0.3.8'

Read in data

In [4]:
adata=st.read(file_name='./data_Nestorowa.tsv.gz')
Using default working directory.
Saving results in: /Users/huidong/Projects/Github/STREAM/tutorial/stream_result
In [ ]:
 

To load and use 10x Genomics single cell RNA-seq data processed with Cell Ranger:
(Make sure 'genes.tsv' and 'barcodes.tsv' are under the same folder as 'matrix.mtx')

adata=st.read(file_name='./filtered_gene_bc_matrices/hg19/matrix.mtx',file_format='mtx')
In [ ]:
 

read in cell labels and label color

if cell label file or cell color file is not provided, please simply run:

'unknown' will be added as the default label for all cells
st.add_cell_labels(adata)

'random color will be generated for each cell label
st.add_cell_colors(adata)

In [5]:
st.add_cell_labels(adata,file_name='./cell_label.tsv.gz')
st.add_cell_colors(adata,file_name='./cell_label_color.tsv.gz')

other useful preprocessing steps when dealing with raw-count data

Normalize gene expression based on library size
st.normalize_per_cell(adata)

Logarithmize gene expression
st.log_transform(adata)

Remove mitochondrial genes
st.remove_mt_genes(adata)

Filter out cells
st.filter_cells(adata)

Filter out genes

In [6]:
st.filter_genes(adata,min_num_cells = max(5,int(round(adata.shape[0]*0.001))),
                min_pct_cells = None,expr_cutoff = 1)
Filter genes based on min_num_cells
After filtering out low-expressed genes: 
1656 cells, 4762 genes
In [ ]:
 

check parameters
st.select_variable_genes?

Please check if the blue curve fits the points well.
If not, try to lower the parameter 'loess_frac' (By default, loess_frac=0.1) until the blue curve fits well.

e.g. st.select_variable_genes(adata,loess_frac=0.01)

In [7]:
st.select_variable_genes(adata)
239 variable genes are selected

alternatively, user can also select top principal components

st.select_top_principal_components?
st.select_top_principal_components(adata)

In [ ]:
 

check parameters
st.dimension_reduction?

Tips:

by default n_components =3
For biological process with simple bifurcation or linear trajectory, two components would be recommended
e.g, st.dimension_reduction(adata,n_components =2)

Several alternative dimension reduction methods are also supported, se(spectral embedding), umap, pca.
by default, method ='mlle'.

  • For large dataset, se(Spectral Embedding) works faster than MLLE while preserving the similar compact structure to MLLE.
    e.g. st.dimension_reduction(adata,method ='se')
  • For large dataset, lowering the percentage of neighbors (nb_pct=0.1 by default) will speed up this step
    e.g, st.dimension_reduction(adata,nb_pct =0.01)
In [8]:
st.dimension_reduction(adata)
feature var_genes is being used ...
8 cpus are being used ...
In [9]:
st.plot_dimension_reduction(adata)
In [ ]:
 

check parameters
st.plot_visualization_2D?

Tips:

Before the downstream elastic principal graph learning, it is important to visualize the top components in 2D plane with UMAP (by default) or tSNE(st.plot_visualization_2D(adata,method='tsne')) to confirm the existence of meaningful biological trajectory pattern

In [10]:
st.plot_visualization_2D(adata)
In [ ]:
 

check parameters
st.seed_elastic_principal_graph?

Tips:

To better scale up STREAM to large datasets, since version 0.3.8, the default clustering method has been changed from 'ap' (affinity propagation) to 'kmeans'. Users can specify clustering = 'ap' to reproduce the analyses in STREAM paper:
i.e. st.seed_elastic_principal_graph(adata,clustering='ap')

If cells form a big bulk in MLLE space, 'ap' may generate too many branches.
In that case, try clustering = 'kmeans' or clustering = 'sc' to avoid a too complex initial strcuture

For noisy dataset, spectral clustering is proved to work better to get rid of noisy branches
e.g. st.seed_elastic_principal_graph(adata,clustering='sc',n_clusters=10)

In [11]:
st.seed_elastic_principal_graph(adata)
Seeding initial elastic principal graph...
Clustering...
K-Means clustering ...
The number of initial nodes is 10
Calculatng minimum spanning tree...
Number of initial branches: 3
In [12]:
st.plot_branches(adata)
st.plot_branches_with_cells(adata)

check parameters
st.elastic_principal_graph?

Tips:

  • Increase the parameter 'epg_alpha' will help control spurious branches(by default epg_alpha=0.02)
    e.g. st.elastic_principal_graph(adata,epg_alpha=0.03)
  • Add 'epg_trimmingradius' will help get rid of noisy points (by defalut epg_trimmingradius=Inf)
    e.g. st.elastic_principal_graph(adata,epg_trimmingradius=0.1)
In [13]:
st.elastic_principal_graph(adata)
Learning elastic principal graph...
[1]
 "Constructing tree 1 of 1 / Subset 1 of 1"


[1]
 "Computing EPG with 50 nodes on 1656 points and 3 dimensions"


[1]
 "Using a single core"


Nodes = 
10
 
11
 
12
 
13
 
14
 
15
 
16
 
17
 
18
 
19
 
20
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 
29
 
30
 
31
 
32
 
33
 
34
 
35
 
36
 
37
 
38
 
39
 
40
 
41
 
42
 
43
 
44
 
45
 
46
 
47
 
48
 
49
 


BARCODE	ENERGY	NNODES	NEDGES	NRIBS	NSTARS	NRAYS	NRAYS2	MSE	MSEP	FVE	FVEP	UE	UR	URN	URN2	URSD

2||50
	
0.0001097
	
50
	
49
	
44
	
2
	
0
	
0
	
6.533e-05
	
6.253e-05
	
0.964
	
0.9655
	
3.93e-05
	
5.071e-06
	
0.0002536
	
0.01268
	
0


10.219 sec elapsed

[[1]]



Number of branches after learning elastic principal graph: 5
In [14]:
st.plot_branches(adata)
st.plot_branches_with_cells(adata)

Tips:

  • Add 'epg_trimmingradius' will help get rid of noisy points (by defalut epg_trimmingradius=Inf)
    e.g. st.optimize_branching(adata,epg_trimmingradius=0.1)
In [15]:
### optional step
st.optimize_branching(adata)
st.plot_branches(adata)
st.plot_branches_with_cells(adata)
Optimizing branching...
[1]
 "Constructing tree 1 of 1 / Subset 1 of 1"


[1]
 "Computing EPG with 80 nodes on 1656 points and 3 dimensions"


[1]
 "Using a single core"


Nodes = 
50
 
51
 
52
 
53
 
54
 
55
 
56
 
57
 
58
 
59
 
60
 
61
 
62
 
63
 
64
 
65
 
66
 
67
 
68
 
69
 
70
 
71
 
72
 
73
 
74
 
75
 
76
 
77
 
78
 
79
 


BARCODE	ENERGY	NNODES	NEDGES	NRIBS	NSTARS	NRAYS	NRAYS2	MSE	MSEP	FVE	FVEP	UE	UR	URN	URN2	URSD

2||80
	
7.157e-05
	
80
	
79
	
74
	
2
	
0
	
0
	
4.513e-05
	
4.285e-05
	
0.9751
	
0.9764
	
2.029e-05
	
6.145e-06
	
0.0004916
	
0.03933
	
0


3.739 sec elapsed

Number of branches after optimizing branching: 5
In [ ]:
 

Other optional steps:

  • Prune branches:
    st.prune_elastic_principal_graph?
    st.prune_elastic_principal_graph(adata)
  • Shift branching node:
    st.shift_branching?
    st.shift_branching(adata)
In [ ]:
 

check parameters
st.extend_elastic_principal_graph?

Tips:

  • Add 'epg_trimmingradius' will help get rid of noisy points (by defalut epg_trimmingradius=Inf)
    e.g. st.extend_elastic_principal_graph(adata,epg_trimmingradius=0.1)
In [16]:
###Extend leaf branch to reach further cells 
st.extend_elastic_principal_graph(adata)
st.plot_branches(adata)
st.plot_branches_with_cells(adata)
Extending leaves with additional nodes ...
Number of branches after extending leaves: 5