%pylab inline
import pandas as pd
import plotnine as p
import anndata
Populating the interactive namespace from numpy and matplotlib
/home/vale/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters
adata = anndata.read('tcells.h5ad')
First we find a low-dimensional representation of these cells. Here I will be using truncated SVD to find an ~25 dimensional representation. I'm mostly doing this out of convenience, since we are not really interested in analysing these cells per se. The best way to learn proper latent representations for the cells would be to use scVI (or similar methods), but I don't have that installed on the computer I'm running this. A slightly better way would be to regress out count depth and perform actual PCA rather than simple SVD. But the point here is just the timing of what you would do downstream of finding a reprsentation.
from sklearn.decomposition import TruncatedSVD
tsvd = TruncatedSVD(n_components=25)
Y = tsvd.fit_transform(np.log1p(adata.X))
for i, y in enumerate(Y.T):
adata.obs[f'SVD_{i}'] = y
p.ggplot(p.aes('SVD_0', 'SVD_1', color='Tech'), adata.obs) \
+ p.geom_point(size=0.1, alpha=0.33)
/home/vale/anaconda3/lib/python3.6/site-packages/plotnine/utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead. ndistinct = ids.apply(len_unique, axis=0).as_matrix() /home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version. object.__getattribute__(self, name) /home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version. return object.__setattr__(self, name, value)
<ggplot: (8768079457505)>
p.ggplot(p.aes('SVD_0', 'SVD_2', color='Tech'), adata.obs) \
+ p.geom_point(size=0.1, alpha=0.33)
/home/vale/anaconda3/lib/python3.6/site-packages/plotnine/utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead. ndistinct = ids.apply(len_unique, axis=0).as_matrix() /home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version. object.__getattribute__(self, name) /home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version. return object.__setattr__(self, name, value)
<ggplot: (-9223363268769100158)>
p.ggplot(p.aes('SVD_0', 'SVD_3', color='Tech'), adata.obs) \
+ p.geom_point(size=0.1, alpha=0.33)
/home/vale/anaconda3/lib/python3.6/site-packages/plotnine/utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead. ndistinct = ids.apply(len_unique, axis=0).as_matrix() /home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version. object.__getattribute__(self, name) /home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version. return object.__setattr__(self, name, value)
<ggplot: (-9223363268773834913)>
p.ggplot(p.aes('SVD_3', 'SVD_4', color='Tech'), adata.obs) \
+ p.geom_point(size=0.1, alpha=0.33)
/home/vale/anaconda3/lib/python3.6/site-packages/plotnine/utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead. ndistinct = ids.apply(len_unique, axis=0).as_matrix() /home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version. object.__getattribute__(self, name) /home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version. return object.__setattr__(self, name, value)
<ggplot: (8768079190497)>
From this exploration it was clear the first few factors corresponded to known technical factors. I removed these in an attempt to "align" the data, but it didn't work. In the end though. For the sake of the comparison we are aiming for here, it really doesn't matter.
Yd = Y[:, 3:].copy(order='C').astype(double)
Yd.shape
(75029, 22)
First, let's run both FIt-SNE and UMAP on the full dataset to get a qualitative image of how the 22-dimensional representation space gets spread out in the 2D embedding.
import fitsne
import umap
Another quick note: By default FIt-SNE runs 1000 iterations, while UMAP runs 500 iterations for "small" datasets and 200 iterations for "large" datasets. To split the difference, I run 500 iterations for both methods.
YY = fitsne.FItSNE(Yd, max_iter=500)
for i, y in enumerate(YY.T):
adata.obs[f'FItSNE_{i}'] = y
p.ggplot(p.aes('FItSNE_0', 'FItSNE_1', color='Tech'), adata.obs) \
+ p.geom_point(size=0.1, alpha=0.33)
/home/vale/anaconda3/lib/python3.6/site-packages/plotnine/utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead. ndistinct = ids.apply(len_unique, axis=0).as_matrix() /home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version. object.__getattribute__(self, name) /home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version. return object.__setattr__(self, name, value)
<ggplot: (8768085792278)>
ump = umap.UMAP(n_epochs=500)
UU = ump.fit_transform(Yd)
for i, y in enumerate(UU.T):
adata.obs[f'UMAP_{i}'] = y
p.ggplot(p.aes('UMAP_0', 'UMAP_1', color='Tech'), adata.obs) \
+ p.geom_point(size=0.1, alpha=0.33)
/home/vale/anaconda3/lib/python3.6/site-packages/plotnine/utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead. ndistinct = ids.apply(len_unique, axis=0).as_matrix() /home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version. object.__getattribute__(self, name) /home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version. return object.__setattr__(self, name, value)
<ggplot: (8768079454002)>
Something I prefer with tSNE plots compared to UMAP plots is that they use the 2D space better - filling it. This let you see more points at once without them overlapping, it makes it easier to e.g. plot gene expression on all of them.
In tSNE relations between clusters are not meaningful though. The fact that blue blobs appear between the red blobs is just a quirk of how it happened to use the space. The segmentation into clusters is reflected in the faint empty "edges" between them. In UMAP distances between clusters are supposed to be interpretable. This will lead to very different clusters cells being far apart, creating a lot of empty space in order to represent the relative within-cluster heterogeneity.
Now we subsample dthe 75,000 cells at different levels on a logarithimic scale. Then we time FIt-SNE and UMAP for each sample of cells.
sample_sizes = np.around(np.logspace(np.log10(500), np.log10(75000), num=10), -3).astype(int)
sample_sizes[0] = 500
from time import time
result_list = []
for s in sample_sizes:
sample_idx = np.random.choice(np.arange(Yd.shape[0]), s, replace=False)
sample_Y = Yd[sample_idx]
# FIt-SNE
t0 = time()
fitsne.FItSNE(sample_Y, max_iter=500)
runtime = time() - t0
result_list.append({'method': 'FIt-SNE', 'sample_size': s, 'run_time': runtime})
# UMAP
t0 = time()
ump = umap.UMAP(n_epochs=500)
ump.fit_transform(sample_Y)
runtime = time() - t0
result_list.append({'method': 'UMAP', 'sample_size': s, 'run_time': runtime})
results = pd.DataFrame(result_list)
results
method | run_time | sample_size | |
---|---|---|---|
0 | FIt-SNE | 3.397268 | 500 |
1 | UMAP | 2.094727 | 500 |
2 | FIt-SNE | 4.509084 | 1000 |
3 | UMAP | 3.042581 | 1000 |
4 | FIt-SNE | 5.422171 | 2000 |
5 | UMAP | 6.038549 | 2000 |
6 | FIt-SNE | 5.711554 | 3000 |
7 | UMAP | 9.060977 | 3000 |
8 | FIt-SNE | 6.856716 | 5000 |
9 | UMAP | 15.493820 | 5000 |
10 | FIt-SNE | 8.503103 | 8000 |
11 | UMAP | 24.432012 | 8000 |
12 | FIt-SNE | 12.055970 | 14000 |
13 | UMAP | 43.078622 | 14000 |
14 | FIt-SNE | 23.786085 | 25000 |
15 | UMAP | 79.055026 | 25000 |
16 | FIt-SNE | 37.870549 | 43000 |
17 | UMAP | 144.280644 | 43000 |
18 | FIt-SNE | 68.767722 | 75000 |
19 | UMAP | 251.442395 | 75000 |
p.ggplot(p.aes('sample_size', 'run_time', color='method'), results) \
+ p.geom_point(size=3) \
+ p.geom_line() \
+ p.scale_x_log10() \
+ p.scale_y_log10() \
+ p.labs(x='Number of cells', y='Runtime (seconds)', title='Time of embedding 22 PCs for varying sample sizes') \
+ p.theme_classic()
/home/vale/anaconda3/lib/python3.6/site-packages/plotnine/utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead. ndistinct = ids.apply(len_unique, axis=0).as_matrix() /home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version. object.__getattribute__(self, name) /home/vale/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version. return object.__setattr__(self, name, value)
<ggplot: (-9223363268771649066)>
The FIt-SNE implementation is generally faster than UMAP when you have more than 3,000 cells. In the realm of 10,000's of cells FIt-SNE scales at the same rate as UMAP. However, note that this is a log-log scale. Even if FI-tSNE starts scaling at the rate of UMAP, it is still consistently about 4 times faster. In other words, a dataset that takes an hour for UMAP will take 15 minutes for FIt-SNE.