#!/usr/bin/env python # coding: utf-8 # # Blastocyst Development in Mice: Single Cell TaqMan Arrays # #### presented at the EBI BioPreDyn Course 'The Systems Biology Modelling Cycle' # #### Max Zwiessele, Oliver Stegle, Neil Lawrence 12th May 2014 University of Sheffield and EBI. # In this notebook we follow Buettner and Theis (2012) and use the GP-LVM to analyze some single cell data from Guo et al (2010). They performed qPCR TaqMan array on single cells from the developing blastocyst in mouse. The data is taken from the early stages of development when the Blastocyst is forming. At the 32 cell stage the data is already separated into the trophectoderm (TE) which goes onto form the placenta and the inner cellular mass (ICM). The ICM further differentiates into the epiblast (EPI)---which gives rise to the endoderm, mesoderm and ectoderm---and the primitive endoderm (PE) which develops into the amniotic sack. Guo et al selected 48 genes for expression measurement. They labelled the resulting cells and their labels are included as an aide to visualization. # In the paper, they first visualized their data using principal component analysis. In the first two principal components this fails to separate the domains. This is perhaps because the principal components are dominated by the variation in the 64 cell systems. This in turn may be because there are more cells from the data set in that regime, or additionally it could be that the natural variation across the 64 cell systems is greater. Both are probable causes of the dominance of these cells in the first two principal components. The first thing we do is plot the principal coordinates of the cells on the first two dominant directions. # ### This Notebook # In this notebook we will perform PCA on the original data, showing that the different regimes do not separate. Then, we follow Buettner and Theis (2012) in applying the GP-LVM to the data. There is a slight pathology in the result, one which they fixed by using priors that were dependent on the developmental stage. We then show how the Bayesian GP-LVM doesn't exhibit those pathologies and gives nice results that seems to show the lineage of the cells. # This notebook is an extension of [SingleCellDataWithGPy](SingleCellDataWithGPy.ipynb) for tutorial needs. # In[2]: import pods, GPy, itertools get_ipython().run_line_magic('matplotlib', 'inline') from matplotlib import pyplot as plt # Next we load in the data. We've provided a convenience function for loading in the data with GPy. It is loaded in as a `pandas` DataFrame. This allows us to summarize it with the `describe` attribute. # In[3]: s = pods.datasets.singlecell() Ydf = s['Y'] Y = Ydf.values labels = s['labels'] marker = '<>^vsd' Ydf.describe() # - Try figuring out the data a little more detailed. # - How many genes where measured, what are the distributions? # - Try to plot some histograms of the data, so you see how it is distributed. # - Can you plot a heatmap (`matplotlib.imshow()`)? # In[4]: # Type your code here # Here we will define a little routine, which puts the legend on the right side of the plot, so that the legend does not overwright the data plot # In[4]: def legend_to_the_right(ax): box = ax.get_position() ax.set_position([box.x0, box.y0, box.width, box.height]) _ = ax.legend(loc='center left', bbox_to_anchor=(1, 0.5), numpoints=1) def plot_latent(ax, x, y, marker, labels): GPy.plotting.matplot_dep.Tango.reset() # make sure labels are in order of input: ulabels = [] for lab in labels: if not lab in ulabels: ulabels.append(lab) for i, [label, m] in enumerate(zip(ulabels, itertools.cycle(marker))): symbol = marker[i % len(marker)] ind = labels == label ax.plot(x[ind], y[ind], marker=symbol, markerfacecolor=GPy.plotting.matplot_dep.Tango.nextMedium(), linestyle='', label=label, mew=.2, alpha=1) # ### Principal Component Analysis # Now we follow Guo et al in performing PCA on the data. Rather than calling a 'PCA routine', here we break the algorithm down into its steps: compute the data covariance, compute the eigenvalues and eigenvectors and sort according to magnitude of eigenvalue. Because we want to visualize the data, we've chose to compute the eigenvectors of the *inner product matrix* rather than the covariance matrix. This allows us to plot the eigenvalues directly. However, this is less efficient (in this case because the number of genes is smaller than the number of data) than computing the eigendecomposition of the covariance matrix. # Now, using the helper function we can plot the results with appropriate labels. # - Use the `GPy.util.pca.PCA` class to run PCA on the dataset (`Y`) # - plot the result of the PCA using either the `plot_latent` function, or `PCA.plot_2d()` # - plot the eigenvalue fractions for the first 10 dimensions, using `plt.bar`, or `PCA.plot_frac(Q=10)` # In[7]: get_ipython().run_line_magic('pinfo', 'GPy.util.pca.PCA') # Type your code here # ### GP-LVM on the Data # Buettner and Theis applied a Gaussian process latent variable model to the data, but with a modified prior to ensure that small differences between cells at the same differential stage were preserved. Here we apply a standard GP-LVM (no modified prior) to the data. # - Running A GPLVM on the given data is as easy as creating a class from the `GPy.models.GPLVM` # - Beware of high running times, if dimensions are too big. # - optimize the model and plot the latent space (using `m.plot_latent`, or `plot_latent` defined above). # - Try different combinations of kernels, allowed kernels are: # * `GPy.kern.Linear` # * `GPy.kern.White` # * `GPy.kern.Bias` # * `GPy.kern.RBF` # - Try out different kernels, we have found the best is to use `GPy.kern.RBF(Q,ARD=1)+GPy.kern.Bias(Q)`, where Q is the number of latent dimensions # In[8]: # Type your code here # In[6]: # A result could look like this: fig, [ax1, ax2] = plt.subplots(1,2,figsize=(9,4)) m.kern.plot_ARD(ax=ax1) m.plot_latent(labels=labels, marker=marker, legend=False, ax=ax2) _ = legend_to_the_right(ax2) # As a small bonus, you can plot the magnification factor, using # ``` # m.plot_magnification(labels=labels, marker=marker, legend=True) # ``` # here. # # - Can you see the difference between the shading of plot_latent, and plot_magnification? # In[9]: #Type your code here # ### Bayesian GP-LVM # Finally we show the new code that uses the Bayesian GP-LVM to fit the data. This means we can automatically determine the dimensionality of the model whilst fitting a non-linear dimensionality reduction. The approximations we use also mean that it is faster than the original GP-LVM. # # - You can find the BayesianGPLVM in `GPy.models.BayesianGPLVM`. # - Fit a Bayesian GPLVM to the model # - Try different combinations of kernels, allowed kernels are: # * `GPy.kern.Linear` # * `GPy.kern.White` # * `GPy.kern.Bias` # * `GPy.kern.RBF` # - Plot the latent space and magnification and compare to GPLVM # In[10]: #Type your code here # This gives a really nice result. Broadly speaking two latent dimensions dominate the representation. When we visualize using these two dimensions we can see the entire cell phylogeny laid out nicely in the two dimensions. Additionally we can see the missclassification of the some cells, using the 'standard' approach of repeated k-means clustering and PCA on sub clustered (This was used to get the sample colors of the 64 cellstage). # ### Other Nonlinear Approaches # How do other non-linear approaches compare with the GP-LVM? We can use scikit-learn to try a few. First we consider Isomap, which often performs really well on data. # # - We fully put in the workflow for these approaches for you to have a look and compare agains. If you like, feel free to try to improve or change how to use the other nonlinear approaches. # - Try out your own method? # ### Isomap # Isomap first builds a neighbourhood graph, and then uses distances along this graph to approximate the geodesic distance between points. These distances are then visualized by performing classical multidimensional scaling (which involves computing the eigendecomposition of the centred distance matrix). As the neighborhood size is increased to match the data, principal component analysis is recovered (or strictly speaking, principal coordinate analysis). The fewer the neighbors, the more 'non-linear' the isomap embeddings are. # In[10]: n_neighbors = 30 import sklearn.manifold m = sklearn.manifold.Isomap(n_neighbors=n_neighbors, n_components=2) X = m.fit_transform(Ydf) fig, ax = plt.subplots(1) plot_latent(ax, X[:, 0], X[:, 1], marker, labels) _ = legend_to_the_right(ax) # ### Locally Linear Embedding # Next we try locally linear embedding. In locally linear embedding a neighborhood is also computed. Each point is then reconstructed by it's neighbors using a linear weighting. This implies a locally linear patch is being fitted to the data in that region. These patches are assimilated into a large $n\times n$ matrix and a lower dimensional data set which reflects the same relationships is then sought. Quite a large number of neighbours needs to be selected for the data to not collapse in on itself. When a large number of neighbours is selected the embedding is more linear and begins to look like PCA. However, the algorithm does *not* converge to PCA in the limit as the number of neighbors approaches $n$. # In[11]: n_neighbors = 200 m = sklearn.manifold.LocallyLinearEmbedding(n_neighbors=n_neighbors, n_components=2) X = m.fit_transform(Y) fig, ax = plt.subplots(1) plot_latent(ax, X[:, 0], X[:, 1], marker, labels) _ = legend_to_the_right(ax) # #### work funded by the BioPreDyn and MLPM projects, it is a collaboration with Nicolas Durrande, Johannes Jaeger.