Building the methylation count matrices

In order to build count matrices for a small number of cells

In [2]:
import episcanpy.api as epi
import pandas as pd
import anndata as ad

We first define two set of features for the count matrices.

It requires to both load the set of annotations and generate windows (100kb long)

In [3]:
promoters = epi.ct.load_features('../methylation_play_data/mouse_epd_promoters.bed') # generate features
p_names = epi.ct.name_features(promoters) # extract name of the features - produce unique
0.10488486289978027 seconds
In [4]:
windows = epi.ct.make_windows(100000) # generate features
w_names = epi.ct.name_features(windows) # extract name of the features
0.44245481491088867 seconds

Now that the opbtained the potential feature spaces we need to identify the methylation summaries of all cells necessary to build the different count matrices.

The standard input correspond to methylpy methylation summaries.

In [5]:
path_cells = '../methylation_play_data/'
cells = ['cell1.tsv', 'cell2.tsv', 'cell3.tsv', 'cell4.tsv', 'cell5.tsv']
In [6]:
epi.ct.build_count_mtx(cells,
                       annotation=[windows, promoters], # all annotations you want to build your matrix for
                       path=path_cells,
                       output_file=['test_windows_CG.txt', 'test_promoters_CG.txt'], # output file names
                       meth_context='CG', # cytosine context to consider
                       feature_names= [w_names, p_names], # name of the features if you want to write them down
                       threshold=[1, 5])# minimum number of cytosine/reads to have at any given feature to 
                                        # not consider the feature to have a missing methylation level
0 cell1.tsv
1 cell2.tsv
2 cell3.tsv
3 cell4.tsv
4 cell5.tsv
60.72653293609619 seconds

To build a count matrix based on cytosine not in a CG context, simply change teh meth_context argument to 'CH'

In [7]:
epi.ct.build_count_mtx(cells,
                       annotation=[windows, promoters],
                       path=path_cells,
                       output_file=['test_windows_CH.txt', 'test_promoters_CH.txt'],
                       meth_context='CH',
                       feature_names= [w_names, p_names],
                       threshold=[1, 5])
0 cell1.tsv
1 cell2.tsv
2 cell3.tsv
3 cell4.tsv
4 cell5.tsv
228.89920330047607 seconds

If you don't want to write the count matrix down but keep it as loaded matrix:

In [8]:
w_mtx, p_mtx = epi.ct.build_count_mtx(cells,
                                      annotation=[windows, promoters],
                                      path=path_cells,
                                      output_file=None,
                                      meth_context='CG',
                                      threshold=[1, 5])
0 cell1.tsv
1 cell2.tsv
2 cell3.tsv
3 cell4.tsv
4 cell5.tsv
60.9817430973053 seconds
In [9]:
p_mtx.shape
Out[9]:
(5, 18902)

To load the count matrix produced as adata object:

In [10]:
adata_p = ad.AnnData(p_mtx, obs=pd.DataFrame(index=cells), var=pd.DataFrame(index=p_names))
In [11]:
adata_p
Out[11]:
AnnData object with n_obs × n_vars = 5 × 18902 

Finally load the metadata you have on your cells and save the matrix as an AnnData object.

In [12]:
epi.pp.load_metadata(adata_p, '../methylation_play_data/mouse_annot_5cells_Luo17.csv')
0.025742053985595703 seconds
In [13]:
adata_p.write('promoter_matrix_test_CG.h5ad')
... storing 'Animal age' as categorical
... storing 'FACS date' as categorical
... storing 'Brain area' as categorical
... storing 'Laminar layer' as categorical
... storing 'Labeling' as categorical
... storing 'FACS channel' as categorical
... storing 'FACS count' as categorical
... storing 'Bisulfite conversion method' as categorical
... storing 'Library type' as categorical
... storing 'Library pool' as categorical
... storing 'Index i5' as categorical
... storing 'Index i7' as categorical
... storing 'i5 sequence' as categorical
... storing 'i7 sequence' as categorical
... storing 'random primer index' as categorical
... storing 'random primer index sequence' as categorical
... storing 'Sequencing run mode' as categorical
... storing 'Neuron type' as categorical

For further operation on the matrices see the second part the methylation tutorials on quality controls, processing, clustering and cell type identification.