Ref: Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20-31 (2016).

In [1]:
%matplotlib inline

In [2]:
import stream as st

In [3]:
st.__version__

Out[3]:
'0.3.8'

In [4]:
adata=st.read(file_name='./data_Nestorowa.tsv.gz')

Using default working directory.
Saving results in: /Users/huidong/Projects/Github/STREAM/tutorial/stream_result

In [ ]:



To load and use 10x Genomics single cell RNA-seq data processed with Cell Ranger:
(Make sure 'genes.tsv' and 'barcodes.tsv' are under the same folder as 'matrix.mtx')

adata=st.read(file_name='./filtered_gene_bc_matrices/hg19/matrix.mtx',file_format='mtx')

In [ ]:



#### read in cell labels and label color¶

if cell label file or cell color file is not provided, please simply run:

'unknown' will be added as the default label for all cells
st.add_cell_labels(adata)

'random color will be generated for each cell label
st.add_cell_colors(adata)

In [5]:
st.add_cell_labels(adata,file_name='./cell_label.tsv.gz')


#### other useful preprocessing steps when dealing with raw-count data¶

Normalize gene expression based on library size
st.normalize_per_cell(adata)

Logarithmize gene expression
st.log_transform(adata)

Remove mitochondrial genes
st.remove_mt_genes(adata)

Filter out cells
st.filter_cells(adata)

Filter out genes

In [6]:
st.filter_genes(adata,min_num_cells = max(5,int(round(adata.shape[0]*0.001))),
min_pct_cells = None,expr_cutoff = 1)

Filter genes based on min_num_cells
After filtering out low-expressed genes:
1656 cells, 4762 genes

In [ ]:



check parameters
st.select_variable_genes?

Please check if the blue curve fits the points well.
If not, try to lower the parameter 'loess_frac' (By default, loess_frac=0.1) until the blue curve fits well.

e.g. st.select_variable_genes(adata,loess_frac=0.01)

In [7]:
st.select_variable_genes(adata)

239 variable genes are selected


alternatively, user can also select top principal components

st.select_top_principal_components?
st.select_top_principal_components(adata)

In [ ]:



check parameters
st.dimension_reduction?

Tips:

by default n_components =3
For biological process with simple bifurcation or linear trajectory, two components would be recommended
e.g, st.dimension_reduction(adata,n_components =2)

Several alternative dimension reduction methods are also supported, se(spectral embedding), umap, pca.
by default, method ='mlle'.

• For large dataset, se(Spectral Embedding) works faster than MLLE while preserving the similar compact structure to MLLE.
e.g. st.dimension_reduction(adata,method ='se')
• For large dataset, lowering the percentage of neighbors (nb_pct=0.1 by default) will speed up this step
e.g, st.dimension_reduction(adata,nb_pct =0.01)
In [8]:
st.dimension_reduction(adata)

feature var_genes is being used ...
8 cpus are being used ...

In [9]:
st.plot_dimension_reduction(adata)

In [ ]:



check parameters
st.plot_visualization_2D?

Tips:

Before the downstream elastic principal graph learning, it is important to visualize the top components in 2D plane with UMAP (by default) or tSNE(st.plot_visualization_2D(adata,method='tsne')) to confirm the existence of meaningful biological trajectory pattern

In [10]:
st.plot_visualization_2D(adata)

In [ ]:



check parameters
st.seed_elastic_principal_graph?

Tips:

To better scale up STREAM to large datasets, since version 0.3.8, the default clustering method has been changed from 'ap' (affinity propagation) to 'kmeans'. Users can specify clustering = 'ap' to reproduce the analyses in STREAM paper:
i.e. st.seed_elastic_principal_graph(adata,clustering='ap')

If cells form a big bulk in MLLE space, 'ap' may generate too many branches.
In that case, try clustering = 'kmeans' or clustering = 'sc' to avoid a too complex initial strcuture

For noisy dataset, spectral clustering is proved to work better to get rid of noisy branches
e.g. st.seed_elastic_principal_graph(adata,clustering='sc',n_clusters=10)

In [11]:
st.seed_elastic_principal_graph(adata)

Seeding initial elastic principal graph...
Clustering...
K-Means clustering ...
The number of initial nodes is 10
Calculatng minimum spanning tree...
Number of initial branches: 3

In [12]:
st.plot_branches(adata)


check parameters
st.elastic_principal_graph?

Tips:

• Increase the parameter 'epg_alpha' will help control spurious branches(by default epg_alpha=0.02)
e.g. st.elastic_principal_graph(adata,epg_alpha=0.03)
• Add 'epg_trimmingradius' will help get rid of noisy points (by defalut epg_trimmingradius=Inf)
e.g. st.elastic_principal_graph(adata,epg_trimmingradius=0.1)
In [13]:
st.elastic_principal_graph(adata)

Learning elastic principal graph...
[1]
"Constructing tree 1 of 1 / Subset 1 of 1"

[1]
"Computing EPG with 50 nodes on 1656 points and 3 dimensions"

[1]
"Using a single core"

Nodes =
10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

BARCODE	ENERGY	NNODES	NEDGES	NRIBS	NSTARS	NRAYS	NRAYS2	MSE	MSEP	FVE	FVEP	UE	UR	URN	URN2	URSD

2||50

0.0001097

50

49

44

2

0

0

6.533e-05

6.253e-05

0.964

0.9655

3.93e-05

5.071e-06

0.0002536

0.01268

0

10.219 sec elapsed

[[1]]

Number of branches after learning elastic principal graph: 5

In [14]:
st.plot_branches(adata)


Tips:

• Add 'epg_trimmingradius' will help get rid of noisy points (by defalut epg_trimmingradius=Inf)
e.g. st.optimize_branching(adata,epg_trimmingradius=0.1)
In [15]:
### optional step

Optimizing branching...
[1]
"Constructing tree 1 of 1 / Subset 1 of 1"

[1]
"Computing EPG with 80 nodes on 1656 points and 3 dimensions"

[1]
"Using a single core"

Nodes =
50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

BARCODE	ENERGY	NNODES	NEDGES	NRIBS	NSTARS	NRAYS	NRAYS2	MSE	MSEP	FVE	FVEP	UE	UR	URN	URN2	URSD

2||80

7.157e-05

80

79

74

2

0

0

4.513e-05

4.285e-05

0.9751

0.9764

2.029e-05

6.145e-06

0.0004916

0.03933

0

3.739 sec elapsed

Number of branches after optimizing branching: 5

In [ ]:



Other optional steps:

• Prune branches:
st.prune_elastic_principal_graph?
st.prune_elastic_principal_graph(adata)
• Shift branching node:
st.shift_branching?
st.shift_branching(adata)
In [ ]:



check parameters
st.extend_elastic_principal_graph?

Tips:

• Add 'epg_trimmingradius' will help get rid of noisy points (by defalut epg_trimmingradius=Inf)
e.g. st.extend_elastic_principal_graph(adata,epg_trimmingradius=0.1)
In [16]:
###Extend leaf branch to reach further cells

Extending leaves with additional nodes ...