Blastocyst Development in Mice: Single Cell TaqMan Arrays¶

presented at the EBI BioPreDyn Course 'The Systems Biology Modelling Cycle'¶

Max Zwiessele, Oliver Stegle, Neil Lawrence 12th May 2014 University of Sheffield and EBI.¶

In this notebook we follow Buettner and Theis (2012) and use the GP-LVM to analyze some single cell data from Guo et al (2010). They performed qPCR TaqMan array on single cells from the developing blastocyst in mouse. The data is taken from the early stages of development when the Blastocyst is forming. At the 32 cell stage the data is already separated into the trophectoderm (TE) which goes onto form the placenta and the inner cellular mass (ICM). The ICM further differentiates into the epiblast (EPI)---which gives rise to the endoderm, mesoderm and ectoderm---and the primitive endoderm (PE) which develops into the amniotic sack. Guo et al selected 48 genes for expression measurement. They labelled the resulting cells and their labels are included as an aide to visualization.

In the paper, they first visualized their data using principal component analysis. In the first two principal components this fails to separate the domains. This is perhaps because the principal components are dominated by the variation in the 64 cell systems. This in turn may be because there are more cells from the data set in that regime, or additionally it could be that the natural variation across the 64 cell systems is greater. Both are probable causes of the dominance of these cells in the first two principal components. The first thing we do is plot the principal coordinates of the cells on the first two dominant directions.

This Notebook¶

In this notebook we will perform PCA on the original data, showing that the different regimes do not separate. Then, we follow Buettner and Theis (2012) in applying the GP-LVM to the data. There is a slight pathology in the result, one which they fixed by using priors that were dependent on the developmental stage. We then show how the Bayesian GP-LVM doesn't exhibit those pathologies and gives nice results that seems to show the lineage of the cells.

This notebook is an extension of SingleCellDataWithGPy for tutorial needs.

In [2]:

import pods, GPy, itertools
%matplotlib inline
from matplotlib import pyplot as plt

Next we load in the data. We've provided a convenience function for loading in the data with GPy. It is loaded in as a pandas DataFrame. This allows us to summarize it with the describe attribute.

In [3]:

s = pods.datasets.singlecell()
Ydf = s['Y']
Y = Ydf.values
labels = s['labels']
marker = '<>^vsd'
Ydf.describe()

Acquiring resource: singlecell

Details of data: 
qPCR TaqMan array single cell experiment in mouse. The data is taken from the early stages of development when the Blastocyst is forming. At the 32 cell stage the data is already separated into the trophectoderm (TE) which goes onto form the placenta and the inner cellular mass (ICM). The ICM further differentiates into the epiblast (EPI)---which gives rise to the endoderm, mesoderm and ectoderm---and the primitive endoderm (PE) which develops into the amniotic sack. Guo et al selected 48 genes for expression measurement. They labelled the resulting cells and their labels are included as an aide to visualization.

Please cite:
Guoji Guo, Mikael Huss, Guo Qing Tong, Chaoyang Wang, Li Li Sun, Neil D. Clarke, Paul Robson, Resolution of Cell Fate Decisions Revealed by Single-Cell Gene Expression Analysis from Zygote to Blastocyst, Developmental Cell, Volume 18, Issue 4, 20 April 2010, Pages 675-685, ISSN 1534-5807, http://dx.doi.org/10.1016/j.devcel.2010.02.012. (http://www.sciencedirect.com/science/article/pii/S1534580710001103) Keywords: DEVBIO

After downloading the data will take up 233.1 bytes of space.

Data will be stored in /Users/maxz/ods_data_cache/singlecell.

You must also agree to the following license:
ScienceDirect: http://www.elsevier.com/locate/termsandconditions?utm_source=sciencedirect&utm_medium=link&utm_campaign=terms

Do you wish to proceed with the download? [yes/no]
yes
singlecell.csv
Downloading  http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/dataset_mirror/singlecell/singlecell.csv -> /Users/maxz/ods_data_cache/singlecell/singlecell.csv
[==============================]   0.222/0.222MB

Out[3]:

	Actb	Ahcy	Aqp3	Atp12a	Bmp4	Cdx2	Creb312	Cebpa	Dab2	DppaI	...	Sox2	Sall4	Sox17	Snail	Sox13	Tcfap2a	Tcfap2c	Tcf23	Utf1	Tspan8
count	4.370000e+02	4.370000e+02	4.370000e+02	4.370000e+02	4.370000e+02	4.370000e+02	4.370000e+02	4.370000e+02	4.370000e+02	4.370000e+02	...	4.370000e+02	4.370000e+02	4.370000e+02	4.370000e+02	4.370000e+02	4.370000e+02	4.370000e+02	4.370000e+02	4.370000e+02	4.370000e+02
mean	2.089245e-08	2.400458e-08	3.011442e-08	2.009153e-08	1.416476e-08	2.661327e-08	1.828375e-08	2.329519e-08	2.993135e-08	2.077803e-08	...	2.180778e-08	2.146453e-08	2.077803e-08	2.585812e-08	2.473684e-08	2.670481e-08	2.009153e-08	2.231121e-08	2.263158e-08	2.606407e-08
std	1.001146e+00	1.001146e+00	1.001146e+00	1.001146e+00	1.001146e+00	1.001146e+00	1.001146e+00	1.001146e+00	1.001146e+00	1.001146e+00	...	1.001146e+00	1.001146e+00	1.001146e+00	1.001146e+00	1.001146e+00	1.001146e+00	1.001146e+00	1.001146e+00	1.001146e+00	1.001146e+00
min	-2.997659e+00	-2.140030e+00	-1.768643e+00	-2.355953e+00	-4.420191e+00	-1.972546e+00	-2.493296e+00	-2.290915e+00	-1.641875e+00	-1.755517e+00	...	-2.246728e+00	-2.015028e+00	-2.960914e+00	-1.964500e+00	-2.033562e+00	-1.886886e+00	-2.238215e+00	-2.089612e+00	-1.825355e+00	-2.035440e+00
25%	-7.169942e-01	-7.796685e-01	-7.171038e-01	-9.337937e-01	-2.516838e-01	-6.446426e-01	-7.717112e-01	-8.545864e-01	-6.703301e-01	-1.018770e+00	...	-7.734272e-01	-9.343944e-01	-6.515803e-01	-7.255418e-01	-7.276985e-01	-8.376303e-01	-9.886838e-01	-9.197137e-01	-9.236930e-01	-7.673579e-01
50%	9.227372e-02	-1.782182e-01	-1.842611e-01	2.928931e-01	2.146358e-01	-1.353097e-01	2.772327e-01	-6.806050e-02	-2.415102e-01	1.097815e-01	...	7.709900e-03	8.293864e-02	-3.701860e-02	-1.972392e-01	-1.336842e-01	-2.656403e-01	2.828122e-01	-1.836380e-02	-1.127644e-01	-2.462695e-01
75%	6.808323e-01	6.245760e-01	3.312916e-01	8.234601e-01	6.738631e-01	4.854397e-01	8.213037e-01	9.125241e-01	4.854736e-01	9.259312e-01	...	8.608111e-01	8.312436e-01	6.177259e-01	1.082358e+00	5.329677e-01	9.967453e-01	8.099583e-01	1.022974e+00	8.257577e-01	6.422915e-01
max	2.490650e+00	3.564713e+00	2.627702e+00	1.687127e+00	1.569268e+00	3.427747e+00	1.826346e+00	1.729923e+00	2.780331e+00	1.671790e+00	...	1.709987e+00	4.489121e+00	3.565258e+00	1.655467e+00	2.231074e+00	2.104293e+00	4.187598e+00	1.808619e+00	2.208022e+00	2.270316e+00

8 rows × 48 columns

Try figuring out the data a little more detailed.
How many genes where measured, what are the distributions?
Try to plot some histograms of the data, so you see how it is distributed.
Can you plot a heatmap (matplotlib.imshow())?

In [4]:

# Type your code here

Here we will define a little routine, which puts the legend on the right side of the plot, so that the legend does not overwright the data plot

In [4]:

def legend_to_the_right(ax):
    box = ax.get_position()
    ax.set_position([box.x0, box.y0, box.width, box.height])
    _ = ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), numpoints=1)
    
def plot_latent(ax, x, y, marker, labels):
    GPy.plotting.matplot_dep.Tango.reset()
    # make sure labels are in order of input:
    ulabels = []
    for lab in labels:
        if not lab in ulabels:
            ulabels.append(lab)
    for i, [label, m] in enumerate(zip(ulabels, itertools.cycle(marker))):
        symbol = marker[i % len(marker)]
        ind = labels == label
        ax.plot(x[ind], y[ind], marker=symbol, 
                   markerfacecolor=GPy.plotting.matplot_dep.Tango.nextMedium(), 
                linestyle='',   
                label=label, mew=.2, alpha=1)

Principal Component Analysis¶

Now we follow Guo et al in performing PCA on the data. Rather than calling a 'PCA routine', here we break the algorithm down into its steps: compute the data covariance, compute the eigenvalues and eigenvectors and sort according to magnitude of eigenvalue. Because we want to visualize the data, we've chose to compute the eigenvectors of the inner product matrix rather than the covariance matrix. This allows us to plot the eigenvalues directly. However, this is less efficient (in this case because the number of genes is smaller than the number of data) than computing the eigendecomposition of the covariance matrix.

Now, using the helper function we can plot the results with appropriate labels.

Use the GPy.util.pca.PCA class to run PCA on the dataset (Y)
plot the result of the PCA using either the plot_latent function, or PCA.plot_2d()
plot the eigenvalue fractions for the first 10 dimensions, using plt.bar, or PCA.plot_frac(Q=10)

In [7]:

GPy.util.pca.PCA?
# Type your code here

GP-LVM on the Data¶

Buettner and Theis applied a Gaussian process latent variable model to the data, but with a modified prior to ensure that small differences between cells at the same differential stage were preserved. Here we apply a standard GP-LVM (no modified prior) to the data.

Running A GPLVM on the given data is as easy as creating a class from the GPy.models.GPLVM
Beware of high running times, if dimensions are too big.
optimize the model and plot the latent space (using m.plot_latent, or plot_latent defined above).
Try different combinations of kernels, allowed kernels are:

GPy.kern.Linear
GPy.kern.White
GPy.kern.Bias
GPy.kern.RBF

Try out different kernels, we have found the best is to use GPy.kern.RBF(Q,ARD=1)+GPy.kern.Bias(Q), where Q is the number of latent dimensions

In [8]:

# Type your code here

In [6]:

# A result could look like this:
fig, [ax1, ax2] = plt.subplots(1,2,figsize=(9,4))

m.kern.plot_ARD(ax=ax1)
m.plot_latent(labels=labels, marker=marker, legend=False, ax=ax2)

_ = legend_to_the_right(ax2)

As a small bonus, you can plot the magnification factor, using

m.plot_magnification(labels=labels, marker=marker, legend=True)

here.

Can you see the difference between the shading of plot_latent, and plot_magnification?

In [9]:

#Type your code here

Bayesian GP-LVM¶

Finally we show the new code that uses the Bayesian GP-LVM to fit the data. This means we can automatically determine the dimensionality of the model whilst fitting a non-linear dimensionality reduction. The approximations we use also mean that it is faster than the original GP-LVM.

You can find the BayesianGPLVM in GPy.models.BayesianGPLVM.
Fit a Bayesian GPLVM to the model
Try different combinations of kernels, allowed kernels are:

GPy.kern.Linear
GPy.kern.White
GPy.kern.Bias
GPy.kern.RBF

Plot the latent space and magnification and compare to GPLVM

In [10]:

#Type your code here

This gives a really nice result. Broadly speaking two latent dimensions dominate the representation. When we visualize using these two dimensions we can see the entire cell phylogeny laid out nicely in the two dimensions. Additionally we can see the missclassification of the some cells, using the 'standard' approach of repeated k-means clustering and PCA on sub clustered (This was used to get the sample colors of the 64 cellstage).

Other Nonlinear Approaches¶

How do other non-linear approaches compare with the GP-LVM? We can use scikit-learn to try a few. First we consider Isomap, which often performs really well on data.

We fully put in the workflow for these approaches for you to have a look and compare agains. If you like, feel free to try to improve or change how to use the other nonlinear approaches.
Try out your own method?

Isomap¶

Isomap first builds a neighbourhood graph, and then uses distances along this graph to approximate the geodesic distance between points. These distances are then visualized by performing classical multidimensional scaling (which involves computing the eigendecomposition of the centred distance matrix). As the neighborhood size is increased to match the data, principal component analysis is recovered (or strictly speaking, principal coordinate analysis). The fewer the neighbors, the more 'non-linear' the isomap embeddings are.

In [10]:

n_neighbors = 30
import sklearn.manifold
m = sklearn.manifold.Isomap(n_neighbors=n_neighbors, n_components=2)
X = m.fit_transform(Ydf)
fig, ax = plt.subplots(1)
plot_latent(ax, X[:, 0], X[:, 1], marker, labels)
_ = legend_to_the_right(ax)

Locally Linear Embedding¶

Next we try locally linear embedding. In locally linear embedding a neighborhood is also computed. Each point is then reconstructed by it's neighbors using a linear weighting. This implies a locally linear patch is being fitted to the data in that region. These patches are assimilated into a large $n\times n$ matrix and a lower dimensional data set which reflects the same relationships is then sought. Quite a large number of neighbours needs to be selected for the data to not collapse in on itself. When a large number of neighbours is selected the embedding is more linear and begins to look like PCA. However, the algorithm does not converge to PCA in the limit as the number of neighbors approaches $n$.

In [11]:

n_neighbors = 200
m = sklearn.manifold.LocallyLinearEmbedding(n_neighbors=n_neighbors, n_components=2)
X = m.fit_transform(Y)
fig, ax = plt.subplots(1)
plot_latent(ax, X[:, 0], X[:, 1], marker, labels)
_ = legend_to_the_right(ax)

Blastocyst Development in Mice: Single Cell TaqMan Arrays¶

presented at the EBI BioPreDyn Course 'The Systems Biology Modelling Cycle'¶

Max Zwiessele, Oliver Stegle, Neil Lawrence 12th May 2014 University of Sheffield and EBI.¶

This Notebook¶

Principal Component Analysis¶

GP-LVM on the Data¶

Bayesian GP-LVM¶

Other Nonlinear Approaches¶

Isomap¶

Locally Linear Embedding¶

work funded by the BioPreDyn and MLPM projects, it is a collaboration with Nicolas Durrande, Johannes Jaeger.¶