This notebook details the usage of MAGIC for single cell RNA-seq data.
MAGIC can read single cell RNA-seq data from a csv or sparse mtx file. The csv file contains cells in the rows and genes in the columns. First step is to import the package. The following code snippet imports the magic
package along with other plotting related imports
import magic
# Plotting and miscellaneous imports
import os
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
%matplotlib inline
The data can be loaded using the magic.mg.SCData.from_csv
or magic.mg.SCData.from_mtx
functions.
# Load single-cell RNA-seq data
scdata = magic.mg.SCData.from_csv(os.path.expanduser('~/Documents/Sophomore/Lab/wishbone/data/sdata_nn_TGFb_day_8_10.csv'),
data_type='sc-seq', normalize=False)
This will create an object of the type magic.mg.SCData
which is the base class for the analysis.
scdata
SCData: 7523 cells x 28909 genes cluster_assignments=None data_type=True diffusion_eigenvalues=None diffusion_eigenvectors=None diffusion_map_correlations=None library_sizes=None magic=None metadata=True normalized=True pca=None tsne=None
This shows that the data matrix contains 7523
cells and 28909
genes along with the different properties of the magic.mg.SCData
class.
The scdata
object can also be initialized using a pandas DataFrame
. An example is shown below
scdata = magic.mg.SCData(df, 'sc-seq')
The first step in data processing for MAGIC is to determine the molecule per cell and molecule per gene cutoffs with which to filter the data
fig, ax = scdata.plot_molecules_per_cell_and_gene()
3.23095955575 4.40377233181
From these histograms, choose the appropriate cutoffs to filter the data. In this case, the data has already been filtered.
# Minimum molecules/cell value
CELL_MIN = 0
# Maximum molecules/cell values
CELL_MAX = 1000000
# Minimum number of nonzero cells/gene
# (None if no filtering desired)
GENE_NONZERO = None
# Minimum number of molecules/gene
# (None if no filtering desired)
GENE_MOLECULES = None
scdata.filter_scseq_data(filter_cell_min=CELL_MIN, filter_cell_max=CELL_MAX,
filter_gene_nonzero=GENE_NONZERO, filter_gene_mols=GENE_MOLECULES)
Next, we will normalize the data by dividing each cell by its molecule count and multiplying the counts of cells by the median of the molecule counts.
scdata = scdata.normalize_scseq_data()
The SCData
object can be saved to a pickle file and loaded using the save
and load
functions.
scdata.save('scdata.p')
scdata = magic.mg.SCdata.load('scdata.p')
Before running MAGIC, a PCA plot showing the percent of variance explained by the top PCA components can be used to choose the number of PCA components to include when running MAGIC.
fig, ax = scdata.plot_pca_variance_explained(n_components=40, random=True)
When running MAGIC, a number of parameters can be specified.
# MAGIC
scdata.run_magic(n_pca_components=20, random_pca=True, t=6, k=30,
ka=10, epsilon=1, rescale_percent=99)
doing PCA Computing distances Autotuning distances Computing kernel MAGIC: L_t = L^t MAGIC: data_new = L_t * data
fig, ax = scdata.scatter_gene_expression(['VIM', 'CDH1'], color='ZEB1')
ax.set_xlabel('Vimentin (VIM)')
ax.set_ylabel('E-cadherin (CDH1)')
<matplotlib.text.Text at 0x1c878af60>
fig, ax = scdata.magic.scatter_gene_expression(['MAGIC VIM', 'MAGIC CDH1'], color ='MAGIC ZEB1')
ax.set_xlabel('MAGIC Vimentin (VIM)')
ax.set_ylabel('MAGIC E-cadherin (CDH1)')
<matplotlib.text.Text at 0x1c87b5a90>
fig, ax = scdata.scatter_gene_expression(['VIM', 'CDH1', 'FN1'], color='ZEB1')
ax.set_xlabel('Vimentin (VIM)')
ax.set_ylabel('E-cadherin (CDH1)')
ax.set_zlabel('Fibronectin (FN1)')
<matplotlib.text.Text at 0x10f4f8898>
fig, ax = scdata.magic.scatter_gene_expression(['MAGIC VIM', 'MAGIC CDH1', 'MAGIC FN1'], color='MAGIC ZEB1')
ax.set_xlabel('MAGIC Vimentin (VIM)')
ax.set_ylabel('MAGIC E-cadherin (CDH1)')
ax.set_zlabel('MAGIC Fibronectin (FN1)')
ax.set_zlim(35, 150)
(35, 150)
scdata.run_pca()
scdata.magic.run_pca()
gs = gridspec.GridSpec(2,2)
fig = plt.figure(figsize=[15, 12])
genes = ['CDH1', 'VIM', 'EZH2', 'ZEB1']
for i in range(len(genes)):
ax = fig.add_subplot(gs[i//2, i%2])
scdata.scatter_gene_expression(genes=['PC2', 'PC3'], color=genes[i], fig=fig, ax=ax)
gs = gridspec.GridSpec(2,2)
fig = plt.figure(figsize=[15, 12])
genes = ['MAGIC CDH1', 'MAGIC VIM', 'MAGIC EZH2', 'MAGIC ZEB1']
for i in range(len(genes)):
ax = fig.add_subplot(gs[i//2, i%2])
scdata.magic.scatter_gene_expression(genes=['PC2', 'PC3'], color=genes[i], fig=fig, ax=ax)
scdata.run_tsne()
fig, ax = scdata.plot_gene_expression(genes=['CDH1', 'VIM', 'FN1', 'ZEB1'])
colors = {'MAGIC CDH1': scdata.magic.data['MAGIC CDH1'],
'MAGIC VIM': scdata.magic.data['MAGIC VIM'],
'MAGIC FN1': scdata.magic.data['MAGIC FN1'],
'MAGIC ZEB1': scdata.magic.data['MAGIC ZEB1']}
fix, ax = scdata.plot_gene_expression(genes=colors)