Selecting the right tool for the job: a comparison of visualization algorithms

Daniel Burkhardt*, Scott Gigante*, Smita Krishnaswamy

Yale University, New Haven, CT.

*Equal contribution

How to read this notebook

If you are viewing this on Binder go to the toolbar above and click Cell->Run all to load the interactive plots. You can also press Shift-Return to run a single cell.

Parts of this notebook require an active Python server, which is why we recommend it be viewed in Binder. You can also run it on your own local Jupyter server. If you have any problems, please post an issue on GitHub.

In [1]:
import warnings
warnings.simplefilter("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ipywidgets as widgets
import scprep
import os
from blog_tools import data, embed, interact
%matplotlib inline

Introduction

What is good visualization?

Exploring high-dimensional data typically involves some form of dimensionality reduction to two or three dimensions for visualization. The goal of visualization is to gain insight into the structure of the data, specifically the relationships between points and structures formed in the data. Visualization is an important tool for hypothesis generation, for narrowing the scope of investigation, and to aid in the selection of tools for future analysis. The importance of this problem is reflected in the many visualization tools have been developed for high-dimensional data. To name a few, we have: Principal Components Analysis (PCA), Multidimensional Scaling (MDS), Diffusion Maps, Isometric Mapping (ISOMAP), Laplcaian Eigenmaps, Locally Linear Embedding (LLE), t-distributed Stochastic Neighbor Embedding (t-SNE), Uniform Manifold Approximation (UMAP), and Potential of Heat Affinity Transition Embedding (PHATE). However, along with a diversity of tools comes a diversity of associated biases and assumptions that these methods introduce, and the dreaded problem of selecting the right tool for a given dataset.

Take, for example, t-SNE. At the time of writing, Visualizing Data using t-SNE (2008) has over 8,400 citations. In some fields, like computational biology, it is difficult to find a published paper without a t-SNE visualization. However, t-SNE has a unique set of biases that lend it to misinterpretation. These limitations are explored in the Distill article How to Use t-SNE Effectively. In that article, the authors show how parameter selection and intrinsic biases in the t-SNE algorithm can lead to misleading visualizations. and the article has become a widely referenced teaching tool for the rational application of the method.

Here, we seek to take the question of How to Use t-SNE Effectively and go one step further: How can you select among the many visualization tools effectively? How can you judge the results of each method? What benchmarks should one employ when considering different methods? Which methods are best suited for different data modalities? The remainder of this article will be divided into three sections.

Structure of the article

1. Selecting toy datasets for comparing methods

First, we will talk about different kinds of structures in data and introduce a few toy datasets. We will use these datasets to discuss the algorithms behind six popular dimensionality reduction methods used for visualization: PCA, MDS, ISOMAP, t-SNE, UMAP, and PHATE.

2. Parameter selection

Next, we will discuss the effect of parameter selection on the output of each algorithm, which can often be a crucial step in producing a good visualization.

3. Real world data

Next, we will apply our seven comparison methods to large, real-world datasets. The emphasis of this section will be how a particular visualization tool biases the observer towards a set of conclusions about a given dataset.

4. Quantitative evaluation of visualizations

Finally, we will examine how to quantitatively evaluate the quality of a dimensionality reduction for the purposes of visualization through the use of ground truth data and labels. This will allow us to make judgements about the qualities of each method in a more robust manner.

Dimensionality reduction vs. visualization

Before we go further, it will be useful to clarify the distinction between the goal of dimensionality reduction algorithms in general and the goal of dimensionality reduction used for data visualization in specific. The important distinction here is that some dimensionality reduction algorithms are useful for reducing the computational complexity of machine learning tasks, but are not good for creating interpretable two- or three-dimensional visualizations of a dataset. Take for example, dimensionality reduction by random orthogonal projection, often used in approximate nearest-neighbors search. This method takes a set of $n$ random orthogonal vectors in the original $d$-dimensional data space with $n ≪ d$ and projects the data onto those random dimensions. This approach preserves most of the structure of the data so that statistics (such as pairwise distances) can be calculated more quickly. However, random projections evenly preserve information across all $n$ dimensions and will not produce features useful for visualizing the relationships between data points. In contrast, visualization tools prioritize information preservation in the first 2 or 3 dimensions so that a human can intuitively explore the data and start to generate hypothesis about the relationships between data points.

Selecting toy datasets for comparing methods

Rigorous application of any computational technique starts with considering the desired application and how well a given method's biases fits that application. Generally, a good place to start is by creating toy data. A good toy data set is small, easy to understand intuitively, and has a clear heuristic for a successful visualization. In How to Use t-SNE Effectively, the authors present several compelling toy datasets. Here, we will consider a few that we selected to start with.

Ground truth images of six datasets

We selected these datasets because of the varying structures and modalites.

  • Swiss roll: This dataset is adapted from sklearn's dataset.make_swiss_roll() method. The data exists in three dimensions with one latent dimension: the distance of each point from the center. This is represented by the color of the points in the above plots. The first two dimensions are generated using sines and cosines of this latent dimension and the third dimension is generated as uniform random noise. In this dataset, the noise dimension is slightly larger than the two shown above. As we will see, this proves problematic for some algorithms. The "ideal" visualization of the swiss roll is a single line that represents the latent dimension of that dataset.
  • Three blobs: The three blobs here are Gaussian clouds that were first generated in just two dimensions and then linearly projected into 200 dimensions using random Gaussian noise. In the original space, the ratio between the orange and blue blobs is 1/5th the distance between orange and green blobs. The goal here is to preserve the relative distances between the blobs while rotating the data back into two dimensions.
  • Uneven circle: Here, the data points are distributed along a circle in 20 dimensions, but the spacing between the points is irregular. The color of each point represents its angular position.
  • Tree: For our final dataset, we look at a collection of points arranges in a tree, generated using the tool Splatter. This data emulates smooth transitions between various states of biological systems. All of the branches of the tree are of equal length, but the points are not spaced evenly along the branches. Furthermore, some of the features change non-linearly along branches. The goal here is to recreate the topology of the six branches shown above.

With these toy datasets in hand, we can begin to compare methods.

Introducing... the algorithms!

For this blog, we wanted to focus on a mix of classic and popular algorithms. Not many people use PCA for rigorous visualization of high-dimensional data (looking at you, population geneticists), but it is a common tool for preliminary analysis. Similarly, MDS, which is actually a collection of methods, is not as frequently applied for data analysis but serves as a foundation for non-linear dimensionality reduction. On the other hand, PHATE is a relatively new method for visualizing high dimensional data that chose to include because we developed it it does a good job of preserving continuous structures within a dataset.

In this section, we will break introduce the algorithms one by one, give a brief overview of how the algorithm works, then show the results of that algorithm on our test cases.

Run the code below to display the discussion.

In [2]:
algorithms = embed.__all__
tab_widget = interact.TabWidget()
for algorithm in algorithms:
    name = algorithm.__name__
    with tab_widget.new_tab(algorithm.__name__):
        interact.display_markdown("md/discussion-{}.md".format(name),
                                  placeholder='Introduction to {}'.format(name))

tab_widget.display()

Parameter selection

When running a visualization tool, the user is inevitably faced with the daunting choice of various parameters. Some parameters are robust and barely affect the result of the visualization; others drastically affect the results and need to be tuned with care. Here we use the Swiss Roll dataset presented above to give an introduction to the influence that parameter tuning has on a visualization.

In order to keep things relatively simple, we have selected a maximum of two parameters for each algorithm: the random seed, and the parameter which most strongly affects the result. Some algorithms have many more parameters, and it is worthwhile investigating these in more detail for the visualization algorithm of your choice. You can see a more detailed widget with extra parameters for t-SNE, UMAP and PHATE here.

On the left, you see a 3D projection of the Swiss Roll. Move the sliders below the plot to adjust the parameters for the visualization on the right. If the points don't appear on your screen, try moving your mouse over the plot or double clicking on the visualization.

In [3]:
import pickle
results = pickle.load(open("data/parameter_search.pickle", 'rb'))
dataset = data.swissroll()
tab_widget = interact.TabWidget()
for algorithm in results.keys():
    with tab_widget.new_tab(algorithm):
        interact.parameter_plot(algorithm, results, dataset.X_true, c=dataset.c)

tab_widget.display()

Feel free to explore this visualization and draw your own conclusions. Here are a few important points:

Random seed

Ideally, if you visualize a dataset with the same parameters, your visualization should be the same every time. Unfortunately, since many methods use randomization for speed improvements, the random seed (initialization for Python's pseudo-random number generator) can impact the result. All of the methods shown here accept a random seed as a parameter, but as you can see, it has a highly variable effect.

Firstly, PCA, ISOMAP and PHATE show no noticeable change when you adjust the random seed slider. We call this property robustness, and its a useful trait to find in a visualization algorithm because it increases reproducibility. MDS shrinks and warps ever so slightly, but the overall structure of the visualization does not change. On the other hand, UMAP and t-SNE significantly change the layout of the visualization, relationships between groups, and some points even totally change position on the plot, depending on the choice of other parameters.

Neighborhood size

Now consider the other parameters we have shown: knn (also known as n_neighbors, used in ISOMAP, UMAP and PHATE) and perplexity (used in t-SNE). All of these parameters modify the connectivity of the graph underlying the visualization.

If we connectivity is too low, the graph starts to become disconnected. This is very obvious in t-SNE and UMAP where the plot becomes an apparently random point cloud with knn/perplexity set to 2. A perhaps more interesting case can be observed in ISOMAP in the transition between knn = 5, which gives a perfectly reasonable embedding of the Swiss Roll as a plane, and knn = 3, when the top half of the plane becomes disconnected, leaving the rest to warp and distort in unexpected (and undesired) ways.

On the other hand, if connectivity is too high, we can start to incorporate long-range Euclidean distances, which tend not to be so meaniningful. In ISOMAP, t-SNE and PHATE in particular, we observe that as connectivity becomes extremely high, the start and end of the Swiss Roll become connected and overlap on the plot. This is once again undesirable, as these parts of the Swiss Roll were disconnected in the original data, but thanks to our extremely large neighborhoods we have connected them, giving a misleading visualization.

Other parameters

Most of the methods here have many more parameters than what we have shown. For example, if you look at the scikit-learn documentation for t-SNE, you will see 13 parameters that can be set on the estimator. It is worth investigating which of these influence the embedding, and understanding why you get the visualization you do with the parameters you have chosen.

Real world data

The role of real data in comparing visualization algorithms

Toy data is useful because it provides insights into the kinds of idiosyncracies that a given algorithm is sensitive to. However, it is also important that a dimensionality reduction algorithm performs well on real world data. Although you lose access to ground truth, it is still possible to compare algorithms.

Generally spearking it useful to first consider well existing knowledge is preserved by an given embedding. Are groups of data points that are known to be closely related displayed proximally in the visualization? Can longer-range relationships be observed across the plot?

Once these properties have been confirmed, we usually will generate some hypothesis about the data and further examine trends within the data. This might mean gathering additional information about the data by integrating another outside dataset or performing another experiment. Although this process is challenging, the real utility of a visualization is how successful one can be by making hypotheses from the visualization and testing them using an independent method.

Here, we will describe three real world datasets and compare each visualization algorithm on each dataset. Again, we're going to use default parameters for every method.

Analysis of image data

The Frey faces dataset

This dataset consists of nearly 2000 images of Brendan Frey's face taken as sequential frames from a video. In this dataset, Frey is making a series of facial expressions and many intermediate frames are available between expressions. Hence, we should be able to identify some continuous progressions in facial expression. For example, we should be able to see a smooth transition between a neutral face and a smile or frown.

Hover over the points to see the associated image. Point are colored by frame number, where points in black are at the start of the video are in black and points in yellow are at the end of the video. Click on a point to highlight it on all six plots. If the plot does not display properly, try running the cell again by clicking on the code and hitting Shift+Return.

In [4]:
interact.image_plot(data.frey(), algorithms)
Calculating PCA...
Calculated PCA in 0.07 seconds.
Calculating MDS...
Calculated MDS in 0.79 seconds.
Calculating ISOMAP...
Calculated ISOMAP in 0.07 seconds.
Calculating TSNE...
Calculated TSNE in 0.79 seconds.
Calculating UMAP...
Calculated UMAP in 3.07 seconds.
Calculating PHATE...

This dataset is completely unlabeled and so coming up with a good measure of the embedding quality is difficult. We suggest exploring the faces in each embedding and see what conclusions you can draw. We'll provide our impressions in the next paragraph, but exploring on your own is a good exercise. Moving the mouse around, it is possible to create little movies of Frey's face.

Comparing the six embeddings, we see two trends. First, PCA and MDS produce a single blob with no well-defined structure. In contrast, ISOMAP, UMAP, and PHATE produce visualizations with two distinct collections / blobs of points. Generally speaking, one of the blobs contains images where Frey is frowning, and the other shows Frey with a smile. t-SNE performs similarly here, but we see that in some embeddings (try re-running the code block) the collection of frowns is broken into two groups where we can see some frowns juxtaposed with smiles. Otherwise, most methods do a good job of embedding visually "similar" images next to each other.

One further observation can be made regarding transitions between the two major groups. In UMAP, the smiling images and frowning images are entirely separate with no obvious path between them. On the other hand, ISOMAP and PHATE show two or three bridges between the two clusters, which provide a path from a smile to a grimace to a frown. Such smooth transitions between clusters are impossible to observe in t-SNE and UMAP, as these methods explicitly prevent such occurrences in their optimization.

Analysis of population genetics data

The 1000 Genomes Project

The 1000 Genomes Project is an international effort to understand genetic variation across human populations. Originating in 2007, the project completed in 2013 describing variation in roughly 3000 individuals from 26 populations across more than 80,000 genetics loci. The project is a high quality resource for understanding differences between people from different regions across the globe.

This data is useful for visualization because we have an intuitive understanding of geographic relationships between people, but the data is so high dimensional that understanding if these structures are preserved in the genetic space is infeasible.

In the 1000 Genomes data we also have access to detailed population information about each individual in the study. We expect that geographically proximal populations are genotypically similar and should therefore be grouped together in the visualization.

What does the data look like?

As mentioned above, the 1000 Genomes dataset contains genetic information about 3000 individuals. The data we will embed is called a genotype matrix. The rows of this matrix correspond to each individual in the study, and each column corresponds to a Single Nucleotide Polymorphism (SNP). Each entry in the matrix is 0 if the individual is homozygous for the reference allele, 1 if the individual is heterozygous for the reference and alternative alleles, and 2 if the individual is homozygous for the alternative alleles. If these terms are unfamiliar to you, we suggest reading the NIH primer on SNPs. All that's really important to know if that each entry in the matrix indicates a specific DNA element of each individual in the dataset.

Thankfully Alex Diaz-Papkovich from Simon Gravel's lab at McGill provides scripts to load and preprocess this data on Github. All we need to do is run his scripts through the algorithms for our comparison.

Comparing visualization algorithms on the 1000 Genomes data.

1000 - Genomes comparison

This data is complex with many different populations spanning six continents. Generally speaking, blue corresponds to East Asian, green is South Asian, yellow/orange is African, red represents European, and pink is Ad Mixed American. The exact breakdown of each population code is given by the following table.

Population Code Population Description Super Population
CHB Han Chinese in Beijing, China East Asian
JPT Japanese in Tokyo, Japan East Asian
CHS Southern Han Chinese East Asian
CDX Chinese Dai in Xishuangbanna, China East Asian
KHV Kinh in Ho Chi Minh City, Vietnam East Asian
CEU Utah Residents (CEPH) with Northern and Western European Ancestry European
TSI Toscani in Italia European
FIN Finnish in Finland European
GBR British in England and Scotland European
IBS Iberian Population in Spain European
YRI Yoruba in Ibadan, Nigeria African
LWK Luhya in Webuye, Kenya African
GWD Gambian in Western Divisions in the Gambia African
MSL Mende in Sierra Leone African
ESN Esan in Nigeria African
ASW Americans of African Ancestry in SW USA African
ACB African Caribbeans in Barbados African
MXL Mexican Ancestry from Los Angeles USA Ad Mixed American
PUR Puerto Ricans from Puerto Rico Ad Mixed American
CLM Colombians from Medellin, Colombia Ad Mixed American
PEL Peruvians from Lima, Peru Ad Mixed American
GIH Gujarati Indian from Houston, Texas South Asian
PJL Punjabi from Lahore, Pakistan South Asian
BEB Bengali from Bangladesh South Asian
STU Sri Lankan Tamil from the UK South Asian
ITU Indian Telugu from the UK South Asian

What do we see?

PCA is the most common method for visualizaing population genetics information, but limitations in this method mean that as larger datasets become available, researchers are often searching for other tools (Diaz-Papkovich et al. 2019). Examining the output of PCA on this dataset, we can see clear separation between the African populations denoted by the Yellow/Orange points from the rest of the populations. Similarly, the East Asian population is separated from the rest. However, it is difficult to make statements about the American, South Asian, or European populations.

MDS has some artistic appeal here as the visualization resembles a globe. However, we have a problem here: the pink Ad Mixed American population is divided in two by the South Asian population. We see this as well with PCA, but this does not reconcile with our understanding of genetic variation between these populations. Generally speaking, following humanity's migration out of Africa has been characterized by increasing genetic isolation with limited examples of population mixing (Hellenthal et al. 2014; Norris et al. 2018).

ISOMAP performs poorly here. See how the South Asian and East Asian populations are completely reduced to a very small region plot. This poorly represents the diversity of those populations. Also, ISOMAP creates ridge-like artifacts in the embedding which don't occur in any other embedding, but are a common occurrence in ISOMAP.

t-SNE can be very sensitive to outliers. Here, many individuals have essentially no neighbors in the high dimensional space and therefore they get evenly distributed around the visualization. Not so useful. Also, the size of each cluster in t-SNE is relative to the number of points, not the amount of variance; even though we know (from both the PCA plot and prior knowledge; see Campbell and Tishkoff (2008)) that the genetic diversity in Africa is significantly greater than on all other continents, so we would expect the African population to take up more space on the plot, indicating its large amount of heterogeneity.

UMAP was shown to work well for population genetics data by Diaz-Papkovich et al. (2019), and this was actually the inspriration for this real-world data application. There, the authors used some parameter tuning to generate the plots in the manuscript, so their visualization is slightly nicer than what we see here. However, the game of this blog post is to use default parameters in all comparisons. We see here that the defaults for UMAP produce a visualization that is very compressed, making it hard to see fine-grain relationships between points.

PHATE does a good job here of presenting each population as a separate group of points on the plots. The default parameters also compress a lot of the variation into a few smooth trajectories. One of the places where PHATE shines here is when you consider the way that orderings of each principle component are preserved within each group of points.

Examining PCs along each visualization

Use the dropdown list to select a principal component by which to color the visualizations. Look in particular at PC1, PC2 and PC3.

In [5]:
interact.color_plot(path="img/1000_genomes.comparison.{}.png", name='Component', 
                    options=['PC{}'.format(i) for i in range(1,11)])

In the above plots, we're looking at the first three principal components loadings on each individual in the dataset. If you look at PC1 in PHATE vs UMAP, PHATE preserves the ordering of cells by increasing PC1 loadings whereas UMAP splits the data points with high PC1 coordinates across two different clusters. Examining PC2 and PC3, we see that the embedding produced by MDS and t-SNE order points non-monotonically by their coordinate loadings. This is a little frustrating because we generally think of these coordinates as representing axes of diversity throughout each population.

Analysis of single cell RNA-sequencing data

What is single cell RNA-sequencing?

Every so often a biological discipline encounters a new technology that makes previously unthinkably complicated experiments suddenly accessible. In the field of genomics, single-cell RNA-sequencing (scRNA-seq) is the latest technology to change how biologists approach fundamental questions like "What makes a neuron different from a heart cell from a skin cell?" If you've never heard of this technology before, all you need to know is that it makes it possible to measure which genes are activated in tens of thousands of individual cells. The output of a scRNA-seq experiment is referred to as a gene expression matrix. Here, the rows represent individual cells and each column is one of the 20K-30K genes in the genome of human or mouse or other experimental system. Each entry is this matrix represents the number of molecules of each gene that were detected in that cell. There's a lot of nuance and detail that we've glossed over here, but this expression counts matrix is the primary focus in scRNA-seq analysis. A more thorough discussion can be found in Yuan et al. 2017 and Luecken and Thies, 2019.

Because scRNA-seq produces measurements over tens of thousands of dimensions, it certainly qualifies as high-dimensional. scRNA-seq is also noisy. Single cells contain a very small amount of RNA (think nano-grams), and current chemistries for capturing that RNA is inefficient. Of the roughly 100K-500K RNA molecules in each cell, only 5-20% are captured by current technologies. This means a lot of lowly expressed genes, which are often important for determining cell state, are counted as $0$ in the gene counts matrix despite being expressed in a given cell.

Currently, t-SNE is the most popular method for visualizing single cell data. We recently attended the Single Cell Genomics 2018 conference in Boston and it was difficult to find a poster or talk that didn't use t-SNE. However, as shown above, t-SNE does not do a good job of representing continuous structures within a dataset and has a tendency to shatter connected data points. We specifically developed PHATE as a solution to this problem. To show why it is important to preserve such structure in a large dataset, we are going to focus on a time course of human stem cell development.

Embryoid Body Time course

While developing PHATE, we partnered with Natalia Ivanova's laboratory at Yale University to create a detailed time course study of stem cell development in vitro. The experiment used human embryonic stem cells grown in 3-dimensional culture. The resulting cell cultures are called embryoid bodies or EBs because they recapitulate the spatial component of embryological development. This allows the EB culture to differentiate into lineages similar to those of a real embryo, though the macroscopic structure is not similar to a real embryo.

The dataset comprises 16,820 cells measured over 27 days of EB culture. After filtering, we find that 17,445 genes are expressed at some point during the time course. Each visualization was generated using default parameters for each algorithm run on the square root transformed and normalized data. Because t-SNE and UMAP fail completely when run on the raw data, we run all algorithms on the first 100 principal components. Although we did not impute values before visualization, to examine expressed markers on these embeddings, we imputed missing gene expression values using MAGIC. MAGIC is a method developed in our lab for manifold-based denoising single cell gene expression (van Dijk et al. 2018).

All preprocessing steps can be found in this tutorial.

Visualizations of the EB dataset

EB - comparison

PCA is designed to identify linear combinations of genes that explain maximal variance in a dataset. Here we see that PCA does a good job of identifying the largest axis of variation, which is the time point of collection. However, as PCA is limited to identifying linear combinations of features, and variation is calculated globally. Hence the fine-grained local structure of the data is lost.

MDS performs no better than PCA on this dataset. This is not entirely surprising because both MDS and PCA preserve global linear relationships between all data points.

ISOMAP performs slightly better than PCA and MDS because we can see some more structure in the intermediate 12-15 and 18-21 time points while preserving the general ordering of the data. We do see in some regions of the data overlapping cells from different time points, but this is not entirely inconsistent with the biology of the system. To better judge this embedding, we will need to examine expression of marker genes in the next section.

t-SNE is able to separate some of the branching structures in the data, but separates the branches into clusters such that the connections between them are not clear. On the right side of the plot, there are two groups of cells that are place right next to the cells from the earliest time point. in the top of the plot, we observe a cluster of cells that have the cells from the final time point surrounded on all sides by cells from an earlier stage. This does not map on neatly to our understanding of stem-cell development.

UMAP does even worse than t-SNE here. As we saw with the simulated tree data, UMAP shatters groups of cells and places them at arbitrary distances around the plot. It is possible that both algorithms would perform better with some parameter tuning, but one of the goals of this analysis was to compare each method using default parameters.

PHATE both preserves the global ordering of cells from each sample and preserves fine-scale distinctions between subpopulations of cells. As we will see in the next section, this visualization produces a In the interests of full disclosure, the default parameters for PHATE were decided upon with a few datasets in mind including this one.

Examining marker genes

In biology, distinct cell states (e.g. neurons, cardiomyocytes, T cells, etc.) are often identified through expression of marker genes that are present in one population of cells and not in others. In reality, there is usually a combination of marker genes that denote a specific cell type. As high dimensional transcriptomic phenotyping (i.e. identifying cells using scRNA-seq) becomes more common, even the definition of what defines a cell type is coming under debate.

To identify a meaningful set of cells in the EB time course, our collaborator, Dr. Natalia Ivanova, examined a set of previously described markers known to play a role in stem cell development. She produced the following plot:

EB lineage map

This lineage tree of the EB system shows embryonic stem cells (ESC), the primitive streak (PS), mesoderm (ME), endoderm (EN), neuroectoderm (NE), neural crest (NC), neural progenitors (NP), and others. Red font indicates novel cell precursors. (C) PHATE embedding overlaid with each of the populations in the lineage tree. Other abbreviations include lateral plate ME (LP ME), hemangioblast (H), cardiac (C), epicardial precursors (EP), smooth muscle precursors (SMP), cardiac precursors (CP), and neuronal subtypes (NS).

Use the dropdown list to select a gene by which to color the visualizations. The genes we selected are highlighted in bold on the lineage tree.

In [6]:
interact.color_plot(path="img/EmbryoidBody.{}.png", name='Gene', default='POU5F1',
                    options=['CDX2', 'GATA5', 'GBX2', 'KLF5', 'LEF1', 'MYC',
                             'POU5F1', 'SOX10', 'SOX15', 'T', 'TNNT2'])

We've selected a subset of these genes so that you can explore how specific sets of marker genes change over time. We would recommend starting by looking at POU5F1 (stem cell marker), GBX2 (neuroectoderm marker), and SOX10 (neural crest cell marker). In this trio of markers, we can see clear expression of POU5F1 in the cells from the earliest time point, followed by expression of GBX2. Notice how in t-SNE and UMAP these populations of cells are separated by a white space in the plot, but following a clear trajectory in the PHATE embedding. Further note how in the PCA and MDS plots, GBX2 is expressed in a "smear" on the left sides of the plots. It is hard to identify boundaries of these cell populations given those visualizations. With SOX10 this inability to identify a cell population with PCA and MDS is even clearer, however, it becomes more possible with t-SNE, UMAP, and PHATE.

The next progression worth checking out is LEF1/MYC (early cardiac precursor), GATA5 (intermediate cardiac precursor), TNNT (late cardiac precursor). This one is interesting because these genes are not as cleanly expressed in specific cells. Rather we see broad expression of the early markers especially) followed by increasing restriction of expression in the later markers. The important thing to remember here is that we already know a lot of this biology so when you just look at the gene expression, it is easy to change your initial interpretation of the plot to match your knowledge of the system. But go back to the plot of the time points above and think, "if I had to draw arrows on each plot and identify progression of cells just based on the sample colors, where would the arrows go"? In our opinion, one's intuition from the PHATE plot much more closely matches the underlying system than any other visualization tool.

Quantitative evaluation of visualizations


Quantification cartoon

Although interpretation of a low-dimensional embedding is inherently subjective and necessitates loss of information, it is possible to quantify the accuracy of a two dimensional embedding if we first decide on a set of criteria that we’d like to preserve. Here, we picked a few measures that capture both local and global structure preservation. We'll introduce each measure, and then show how each algorithm performs using simulated data.

Visualization quality metrics

In the original t-SNE paper, van der Maaten et al. presented a measure for quantitative evaluation of t-SNE based on a 1-Nearest Neighbor (1-NN) classifier. In this strategy, data with ground truth labels is embedded in two dimensions. Next, for each point in the embedding, the classifer guesses the true label based on the label of the next closest point. The percentage of correct guesses is the accuracy of the classifier.

The 1-NN strategy is biased toward preserving local connections between data points, and so it is unsurprising that t-SNE performs well in this context. Note, however, that the 1-NN classifier has no way to quantify global relationships in the data. Also, there is relatively little penalty for combining unrelated clusters of separate labels, so long as each group of labels is more or less separate within the structure. In practice, the human eye would likely assume these two groups are of the same type.

An alternative measure for evaluating embeddings of data with ground truth labels is the Adjusted Rand Index (ARI). ARI is used to measure the quality of a clustering by calculating whether pairs of points are assigned to the same cluster in the predicted and ground truth labels respectively, adjusted for the probability of this occurring by random chance. Here, we use k-means clustering applied to the embedding to approximate what we do when considering visualizations without labels; we unconsciously partition the visualized data into groups, and ARI tells us how close this partitioning is to the ground truth. Once again, this metric considers only the local structure of the embedding; is does not evaluate the placement of clusters relative to one another, or global structure of trajectories or other continuously varying data.

The final metric we will show here is called Denosied Embedding Manifold Preservation (DEMaP). We proposed DEMaP in the PHATE paper to address some of the shortcomings of metrics like those above. DEMaP requires the existence of ground truth data, rather than labels. We compute geodesic (or shortest path) distances on the ground truth data, which Tenenbaum et al. showed are equivalent to manifold distances in the noiseless case. We then embed a noisy version of the same data with each visualization technique and compute Spearman correlation of ground truth geodesic distances with Euclidean distances in the embedding. This metric accounts for both local and global distances, and rewards methods which denoise the data accurately.

Simulated data for quantification

In this experiment, we use Splatter (Zappia et al.), a simulation algorithm which generates high-dimensional data from a parametric model. This model allows us to generate data with both noisy and ground truth values. For exact details of the simulations, see the PHATE paper. We show two types of simulation: "groups", which simulates data in clusters, and "paths" which simulates data in a tree structure. Below, you can click the Randomize! button to view a new random initialization of the simulation and the corresponding quantification results for each method. In each case, we sort the columns of the table by the DEMaP score.

In [9]:
data_types = ['paths', 'groups']
tab_widget = interact.TabWidget()
for data_type in data_types:
    with tab_widget.new_tab(data_type.capitalize()):
        interact.quantification_plot(image_path="img/quant/" + data_type + "/{}.png",
                                    data_path="data/quant/" + data_type + "/{}.csv", max_seed=100)

tab_widget.display()

Evaluating the results

Below you can see the results averaged across 100 random seeds. For the 1-NN classifier, t-SNE and UMAP perform best, which is unsurprising as these methods are specifically optimized to preserve local relationships. PHATE performs best in DEMaP by a small margin over UMAP in both groups and paths. UMAP performs best in ARI on paths, but in the groups simulation, where clustering is more relevant, PHATE outperforms these by a large margin.

In [8]:
data_types = ['paths', 'groups']
outputs = []
for data_type in data_types:
    output = widgets.Output()
    path = "data/quant/{}/".format(data_type)
    data = pd.concat([pd.read_csv(path + filename, index_col=0) for filename in os.listdir(path)])
    data = data.groupby('method').agg(np.mean).sort_values('DEMaP', ascending=False).T.round(3)
    data.columns.name = data_type.capitalize()
    with output:
        display(data)
    outputs.append(output)

display(widgets.HBox(outputs))

Further reading

Our group is far from the first to write a comparison of visualization algorithms. Below is a list of articles (in no particular order) that you might find interesting if you would like to read more.