Since our paper on SpatialDE was published today in Nature Methods, I wanted to demonstrate how to use our method on a dataset which came out while the paper was under review.

I spent a few hours this afternoon (March 19), doing this analysis. The data is from Asp et al published in Scientific Reports in October last year. The authors use the Spatial Transcriptomics method to investigate spatial gene expression in cardiac biopsies.

In [1]:
%pylab inline
import pandas as pd
Populating the interactive namespace from numpy and matplotlib

According to Table 1 of Asp et al , individual 3 right atrial appandage is covering 543 "spots", being one of the larger datasets. This data is available on the ST website, I downloaded both tables for Individual3_RAA.

In [22]:
long_table = pd.read_csv('Individual3_RAA_R1.txt', sep='\t')
In [32]:
long_table.head()
Out[32]:
# gene_name barcode_id x y reads_count
0 ENSG00000103512 CCTCGAACTCGAACGTTG 10 5 1
1 ENSG00000172209 CCTCGAACTCGAACGTTG 10 5 1
2 ENSG00000138663 CCTCGAACTCGAACGTTG 10 5 1
3 ENSG00000132541 CCTCGAACTCGAACGTTG 10 5 1
4 ENSG00000114126 CCTCGAACTCGAACGTTG 10 5 1

This table is in "long" format, where each row is a combination of gene, x, y, and gene count. This is generally a pretty handy format, but for our analysis, it will be simpler to work on a wide expression matrix.

In [34]:
counts = long_table \
    .pivot(index='barcode_id', columns='# gene_name', values='reads_count') \
    .fillna(0) \
    .astype(int)
In [35]:
counts.iloc[:10, :5]
Out[35]:
# gene_name ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460
barcode_id
ACAAAGAGATATAGCCCT 0 0 0 0 0
ACAACTATGGGTTGGCGG 0 0 0 0 0
ACAAGCCGCGCTGTTACG 0 0 0 0 0
ACACAGATCCTGTTCTGA 0 0 1 0 0
ACACTACGTCCTGTCGAA 0 0 0 0 0
ACATAGACGCCGTCTCGG 0 0 0 0 0
ACATCACCTGCGCGCTCT 0 0 0 0 0
ACATGTATGGCCGCGCCC 0 0 0 0 0
ACATTAACCTGATAACAT 0 0 0 0 0
ACATTTAAGGCGCATGAT 0 0 0 0 0
In [36]:
counts.shape
Out[36]:
(534, 14980)

To keep track of spatial location of each sample, we create a sample information table.

In [39]:
sample_info = long_table.groupby('barcode_id')['x', 'y'].first()
In [46]:
# Just to make sure ecerything is sorted the same
sample_info = sample_info.sort_index()
counts = counts.sort_index()
In [183]:
figsize(5, 5)
plt.scatter(sample_info.x, sample_info.y, c='k', s=50)
plt.gca().invert_yaxis();
plt.axis('equal');
plt.xlabel('Spatial x-axis')
plt.ylabel('Spatial y-axis');
In [53]:
sample_info['total_count'] = counts.sum(1)
In [63]:
# Filter practically unobserved genes, and remove practically empty "spots"
counts = counts.T[counts.sum(0) >= 3].T
sample_info = sample_info.query('total_count > 5').copy()
counts = counts.loc[sample_info.index]
In [89]:
counts.shape
Out[89]:
(530, 12150)

After filtering, there are 530 spatial "spots", where 12,150 genes have been measured.

In [184]:
figsize(5, 5)
plt.scatter(sample_info.x, sample_info.y, c=np.log10(sample_info.total_count), s=50)
plt.gca().invert_yaxis();
plt.axis('equal');
plt.xlabel('Spatial x-axis')
plt.ylabel('Spatial y-axis');

Before running SpatialDE we need to do a couple of things: first variance stabilize the counts, then control for the variation in count depth for the "spots".

In [72]:
import NaiveDE

# Apply Anscombe's transform for NB data
dfm = NaiveDE.stabilize(counts.T).T

# Extract residuals from regression
resids = NaiveDE.regress_out(sample_info, dfm.T, 'np.log(total_count)').T

# Put spatial coordinates in their own table
X = sample_info[['x', 'y']]

Now we run SpatialDE, this takes about 3 minutes on my iMac

In [73]:
%%time
import SpatialDE

results = SpatialDE.run(X, resids)
INFO:root:Performing DE test
INFO:root:Pre-calculating USU^T = K's ...
INFO:root:Done: 0.34s
INFO:root:Fitting gene models
INFO:root:Model 1 of 10
INFO:root:Model 2 of 10                               
INFO:root:Model 3 of 10                               
INFO:root:Model 4 of 10                                
INFO:root:Model 5 of 10                                
INFO:root:Model 6 of 10                               
INFO:root:Model 7 of 10                               
INFO:root:Model 8 of 10                               
INFO:root:Model 9 of 10                               
INFO:root:Model 10 of 10                              
                                                      
CPU times: user 2min 40s, sys: 3.15 s, total: 2min 43s
Wall time: 2min 38s
In [76]:
results.head(3).T
Out[76]:
0 1 2
FSV 2.05897e-09 2.05897e-09 2.05897e-09
M 4 4 4
g ENSG00000000419 ENSG00000000938 ENSG00000001036
l 0.5 0.5 0.5
max_delta 4.85165e+08 4.85165e+08 4.85165e+08
max_ll 179.568 881.869 327.966
max_mu_hat 0.129841 0.350817 0.186056
max_s2_t_hat 9.60333e-11 2.58001e-10 1.06357e-10
model SE SE SE
n 530 530 530
s2_FSV 0.759402 0.611944 0.0295725
s2_logdelta 1.29702e+17 1.04517e+17 5.05083e+15
time 0.000999928 0.000757933 0.000724792
BIC -334.044 -1738.65 -630.841
max_ll_null 179.567 881.868 327.966
LLR 0.00047229 0.000472289 0.000472279
pval 0.982662 0.982662 0.982662
qval 0.982662 0.982662 0.982662
In [88]:
results.shape
Out[88]:
(12150, 18)
In [77]:
import SpatialDE.plot
In [80]:
# For numerical reasons, sometimes p-values are set to 0 of they are too small.
# Bring these back to reasonable numbers.
results['pval'] = results['pval'].clip_lower(results.query('pval > 0')['pval'].min() / 2)
results['qval'] = results['qval'].clip_lower(results.query('qval > 0')['qval'].min() / 2)
In [81]:
SpatialDE.plot.FSV_sig(results)
In [85]:
sig_res = results.query('qval < 0.05')
In [86]:
sig_res.shape
Out[86]:
(73, 18)

This means 73 genes are significantly spatially variable. We would like to investigate these, so we need to translate the ENSEMBL IDs in the table to gene names that are easier for people to parse. For this, I went to http://uswest.ensembl.org/biomart and downloaded a table of gene ids and gene names for the human gene annotation. I don't know if this is the annotation Asp et al used, but in most cases the IDs haven't changed in a long time.

In [92]:
gene_info = pd.read_csv('gene_ids.tsv', index_col=0, sep='\t')
gene_info.head()
Out[92]:
Gene name Gene description
Gene stable ID
ENSG00000283891 MIR628 microRNA 628 [Source:HGNC Symbol;Acc:HGNC:32884]
ENSG00000251931 RNU6-871P RNA, U6 small nuclear 871, pseudogene [Source:...
ENSG00000207766 MIR626 microRNA 626 [Source:HGNC Symbol;Acc:HGNC:32882]
ENSG00000275323 AC012314.7 40S ribosomal protein S9 [Source:UniProtKB/Sw...
ENSG00000276678 GHRLOS Ghrelin opposite strand RNA conserved region [...
In [95]:
results['gene_name'] = results.g.map(gene_info['Gene name'])
In [96]:
sig_res = results.query('qval < 0.05')

What is the general distribution of length scales of significant genes?

In [99]:
sig_res.l.value_counts()
Out[99]:
0.500000     22
4.178220     21
7.103897     17
2.457457      5
0.850110      4
1.445375      3
12.078193     1
Name: l, dtype: int64

For the ST technology used in Asp et al, the length unit for spatial coordinates is 200 µm, so in real-world coordinates, a lengthscale of 0.5 corresponds to 100 µm, while 4.17 corresponds to 834 µm. These are the roughyl scales at which expression levels change in a meaningful way. Now let's see which genes are represented at these scales. We can probably ignore the results for the smallest lengthscale, since it is so much smaller than smallest observed distance, it is likely to indicate a couple of neighboring spots expressing the gene.

In [129]:
grp = sig_res.groupby('l')
for l, group in grp:
    top_string = ', '.join(group.sort_values('pval').head(10)['gene_name'].values)
    if group.shape[0] > 10:
        top_string += ', ...'

    print(f'l = {l}:')
    print(top_string)
    print()
l = 0.5:
PABPC3, CDH23, AMY2A, LARP1B, ZNF117, LRRC4B, PRR15, ZDHHC11, PTPRD, LYSMD1, ...

l = 0.8501104046527316:
ALMS1, SDK1, NPM3, GFRA3

l = 1.445375400197662:
HBA2, HBB, TPM1

l = 2.4574573326742764:
NPPA, MT-CO3, CST3, CFD, DES

l = 4.178220094993103:
DCN, FBLN1, MT-ATP6, MT-ND4, MT-CO1, MT-CO2, MB, PTGDS, MT-CYB, TNNT2, ...

l = 7.10389675136552:
MT-ND2, MT-ND6, MTRNR2L1, MYH6, GAPDH, MT-ND5, ITLN1, ACTC1, CYCS, C1S, ...

l = 12.078193083829134:
MYL7

At this point it is usually fun to google all these genes, and plot what they look like spatially. I leave that as an exercise to the reader, but plot the top gene at each lengthscale so you can get a feeling for how the different lengthscales look.

In [179]:
figsize(10, 6)
top_list = ['PABPC3', 'ALMS1', 'HBA2', 'NPPA', 'DCN', 'MT-ND2', 'MYL7']
for i, g in enumerate(top_list):
    plt.subplot(2, 4, i + 1)
    g_id = results.query('gene_name == @g').g.iloc[0]
    l = results.query('gene_name == @g').l.iloc[0]
    plt.scatter(sample_info.x, sample_info.y, c=resids[g_id], s=25)
    plt.axis('equal');
    plt.gca().invert_yaxis();
    plt.title(f'l = {l:.2}: {g}')
    
plt.tight_layout()

Now let's investigate what histological patterns in the heart tissue is defined by these spatially variable genes.

I'm excluding the genes with very small lengthscales, because they seem independent from eachother. I'm picking 8 patterns, and a lengthscale of 2.0 which should be compatible with the l=1.4 genes at the same time as using the spatial prior for genes with larger l.

In [238]:
%%time
histology_results, patterns = SpatialDE.spatial_patterns(X, res, sig_res.query('l > 0.8'), 8, 2.0, verbosity=1)
iter 0, ELBO: -2.60e+10
iter 1, ELBO: -1.31e+10, delta_ELBO: 1.29e+10
iter 2, ELBO: -1.31e+10, delta_ELBO: 2.89e+03
iter 3, ELBO: -1.31e+10, delta_ELBO: 1.60e+03
iter 4, ELBO: -1.31e+10, delta_ELBO: 5.32e+02
iter 5, ELBO: -1.31e+10, delta_ELBO: 6.26e+01
iter 6, ELBO: -1.31e+10, delta_ELBO: 8.26e+01
iter 7, ELBO: -1.31e+10, delta_ELBO: 1.22e+00
iter 8, ELBO: -1.31e+10, delta_ELBO: 1.69e-02
iter 9, ELBO: -1.31e+10, delta_ELBO: 2.44e-04
Converged on iter 9
CPU times: user 26.1 s, sys: 1.64 s, total: 27.7 s
Wall time: 14.2 s
In [239]:
histology_results.pattern.value_counts()
Out[239]:
7    11
1    10
6     9
3     8
0     6
2     3
5     2
4     2
Name: pattern, dtype: int64
In [240]:
# Again put in interpretable gene names
histology_results['gene_name'] = histology_results.g.map(gene_info['Gene name'])
In [241]:
for i in histology_results.sort_values('pattern').pattern.unique():
    
    print('Pattern {}'.format(i))
    print(', '.join(histology_results.query('pattern == @i').sort_values('membership')['gene_name'].tolist()))
    print()
Pattern 0
SLC25A4, ATP5A1, TPM1, TTN, ACTC1, MYH6

Pattern 1
MTRNR2L1, MT-ND1, MT-CO3, MT-CO2, MT-CYB, MT-CO1, MT-ND3, MT-ND4, MT-ATP6, MT-ND2

Pattern 2
ALMS1, KRT7, ITLN1

Pattern 3
MTRNR2L12, WISP2, MT-ND6, SDK1, RPL32, MGP, MT-ND5, MT-ND4L

Pattern 4
LDB3, NPM3

Pattern 5
HBA2, HBB

Pattern 6
GFRA3, C1R, C1S, CST3, CFD, DCN, FBLN1, FTL, C3

Pattern 7
CYCS, PTGDS, DES, TNNT2, NPPA, S100A1, MB, CHCHD10, GAPDH, ATP2A2, MYL7

These groups of genes can then be directly related to underlying spatial patterns.

In [245]:
figsize(13, 5.5)
for i, Ci in enumerate(histology_results.sort_values('pattern').pattern.unique()):
    C = patterns[Ci]
    plt.subplot(2, 4, i + 1)
    plt.scatter(sample_info.x, sample_info.y, c=C, s=25)
    plt.axis('equal');
    plt.gca().invert_yaxis();
    plt.title(f'Pattern {i}')
    
plt.tight_layout()