In [1]:

%pylab inline
import pandas as pd
import plotnine as p
import anndata

Populating the interactive namespace from numpy and matplotlib

/home/vale/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

The goal of this notebook is to compare the time needed to embed typical scRNA-seq datasets on the larger scale.

Here I am using a combination of the joint dataset of GSE114724 and GSE114725, which consist of immune cells from breast carcinomas. In total the joint dataset has 75,029 cells

In [2]:

adata = anndata.read('tcells.h5ad')

First we find a low-dimensional representation of these cells. Here I will be using truncated SVD to find an ~25 dimensional representation. I'm mostly doing this out of convenience, since we are not really interested in analysing these cells per se. The best way to learn proper latent representations for the cells would be to use scVI (or similar methods), but I don't have that installed on the computer I'm running this. A slightly better way would be to regress out count depth and perform actual PCA rather than simple SVD. But the point here is just the timing of what you would do downstream of finding a reprsentation.

In [3]:

from sklearn.decomposition import TruncatedSVD

In [4]:

tsvd = TruncatedSVD(n_components=25)

In [5]:

Y = tsvd.fit_transform(np.log1p(adata.X))

In [41]:

for i, y in enumerate(Y.T):
    adata.obs[f'SVD_{i}'] = y

In [42]:

p.ggplot(p.aes('SVD_0', 'SVD_1', color='Tech'), adata.obs) \
    + p.geom_point(size=0.1, alpha=0.33)

/home/vale/anaconda3/lib/python3.6/site-packages/plotnine/utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  ndistinct = ids.apply(len_unique, axis=0).as_matrix()
/home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
/home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)

Out[42]:

<ggplot: (8768079457505)>

In [43]:

p.ggplot(p.aes('SVD_0', 'SVD_2', color='Tech'), adata.obs) \
    + p.geom_point(size=0.1, alpha=0.33)

/home/vale/anaconda3/lib/python3.6/site-packages/plotnine/utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  ndistinct = ids.apply(len_unique, axis=0).as_matrix()
/home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
/home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)

Out[43]:

<ggplot: (-9223363268769100158)>

In [44]:

p.ggplot(p.aes('SVD_0', 'SVD_3', color='Tech'), adata.obs) \
    + p.geom_point(size=0.1, alpha=0.33)

/home/vale/anaconda3/lib/python3.6/site-packages/plotnine/utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  ndistinct = ids.apply(len_unique, axis=0).as_matrix()
/home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
/home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)

Out[44]:

<ggplot: (-9223363268773834913)>

In [45]:

p.ggplot(p.aes('SVD_3', 'SVD_4', color='Tech'), adata.obs) \
    + p.geom_point(size=0.1, alpha=0.33)

/home/vale/anaconda3/lib/python3.6/site-packages/plotnine/utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  ndistinct = ids.apply(len_unique, axis=0).as_matrix()
/home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
/home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)

Out[45]:

<ggplot: (8768079190497)>

From this exploration it was clear the first few factors corresponded to known technical factors. I removed these in an attempt to "align" the data, but it didn't work. In the end though. For the sake of the comparison we are aiming for here, it really doesn't matter.

In [46]:

Yd = Y[:, 3:].copy(order='C').astype(double)

In [55]:

Yd.shape

Out[55]:

(75029, 22)

First, let's run both FIt-SNE and UMAP on the full dataset to get a qualitative image of how the 22-dimensional representation space gets spread out in the 2D embedding.

In [47]:

import fitsne
import umap

Another quick note: By default FIt-SNE runs 1000 iterations, while UMAP runs 500 iterations for "small" datasets and 200 iterations for "large" datasets. To split the difference, I run 500 iterations for both methods.

In [48]:

YY = fitsne.FItSNE(Yd, max_iter=500)

In [49]:

for i, y in enumerate(YY.T):
    adata.obs[f'FItSNE_{i}'] = y

In [50]:

p.ggplot(p.aes('FItSNE_0', 'FItSNE_1', color='Tech'), adata.obs) \
    + p.geom_point(size=0.1, alpha=0.33)

/home/vale/anaconda3/lib/python3.6/site-packages/plotnine/utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  ndistinct = ids.apply(len_unique, axis=0).as_matrix()
/home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
/home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)

Out[50]:

<ggplot: (8768085792278)>

In [51]:

ump = umap.UMAP(n_epochs=500)

In [52]:

UU = ump.fit_transform(Yd)

In [53]:

for i, y in enumerate(UU.T):
    adata.obs[f'UMAP_{i}'] = y

In [54]:

p.ggplot(p.aes('UMAP_0', 'UMAP_1', color='Tech'), adata.obs) \
    + p.geom_point(size=0.1, alpha=0.33)

/home/vale/anaconda3/lib/python3.6/site-packages/plotnine/utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  ndistinct = ids.apply(len_unique, axis=0).as_matrix()
/home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
/home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)

Out[54]:

<ggplot: (8768079454002)>

Something I prefer with tSNE plots compared to UMAP plots is that they use the 2D space better - filling it. This let you see more points at once without them overlapping, it makes it easier to e.g. plot gene expression on all of them.

In tSNE relations between clusters are not meaningful though. The fact that blue blobs appear between the red blobs is just a quirk of how it happened to use the space. The segmentation into clusters is reflected in the faint empty "edges" between them. In UMAP distances between clusters are supposed to be interpretable. This will lead to very different clusters cells being far apart, creating a lot of empty space in order to represent the relative within-cluster heterogeneity.

Now we subsample dthe 75,000 cells at different levels on a logarithimic scale. Then we time FIt-SNE and UMAP for each sample of cells.

In [70]:

sample_sizes = np.around(np.logspace(np.log10(500), np.log10(75000), num=10), -3).astype(int)
sample_sizes[0] = 500

In [72]:

from time import time

In [93]:

result_list = []
for s in sample_sizes:
    sample_idx = np.random.choice(np.arange(Yd.shape[0]), s, replace=False)
    sample_Y = Yd[sample_idx]
    
    # FIt-SNE
    t0 = time()
    fitsne.FItSNE(sample_Y, max_iter=500)
    runtime = time() - t0
    
    result_list.append({'method': 'FIt-SNE', 'sample_size': s, 'run_time': runtime})
    
    # UMAP
    t0 = time()
    ump = umap.UMAP(n_epochs=500)
    ump.fit_transform(sample_Y)
    runtime = time() - t0
    
    result_list.append({'method': 'UMAP', 'sample_size': s, 'run_time': runtime})

In [73]:

In [103]:

results = pd.DataFrame(result_list)
results

Out[103]:

	method	run_time	sample_size
0	FIt-SNE	3.397268	500
1	UMAP	2.094727	500
2	FIt-SNE	4.509084	1000
3	UMAP	3.042581	1000
4	FIt-SNE	5.422171	2000
5	UMAP	6.038549	2000
6	FIt-SNE	5.711554	3000
7	UMAP	9.060977	3000
8	FIt-SNE	6.856716	5000
9	UMAP	15.493820	5000
10	FIt-SNE	8.503103	8000
11	UMAP	24.432012	8000
12	FIt-SNE	12.055970	14000
13	UMAP	43.078622	14000
14	FIt-SNE	23.786085	25000
15	UMAP	79.055026	25000
16	FIt-SNE	37.870549	43000
17	UMAP	144.280644	43000
18	FIt-SNE	68.767722	75000
19	UMAP	251.442395	75000

In [113]:

p.ggplot(p.aes('sample_size', 'run_time', color='method'), results) \
    + p.geom_point(size=3) \
    + p.geom_line() \
    + p.scale_x_log10() \
    + p.scale_y_log10() \
    + p.labs(x='Number of  cells', y='Runtime (seconds)', title='Time of embedding 22 PCs for varying sample sizes') \
    + p.theme_classic()

/home/vale/anaconda3/lib/python3.6/site-packages/plotnine/utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  ndistinct = ids.apply(len_unique, axis=0).as_matrix()
/home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
/home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)

Out[113]:

<ggplot: (-9223363268771649066)>

The FIt-SNE implementation is generally faster than UMAP when you have more than 3,000 cells. In the realm of 10,000's of cells FIt-SNE scales at the same rate as UMAP. However, note that this is a log-log scale. Even if FI-tSNE starts scaling at the rate of UMAP, it is still consistently about 4 times faster. In other words, a dataset that takes an hour for UMAP will take 15 minutes for FIt-SNE.

In [ ]: