Since our paper on SpatialDE was published today in Nature Methods, I wanted to demonstrate how to use our method on a dataset which came out while the paper was under review.

I spent a few hours this afternoon (March 19), doing this analysis. The data is from Asp et al published in Scientific Reports in October last year. The authors use the Spatial Transcriptomics method to investigate spatial gene expression in cardiac biopsies.

In [1]:

%pylab inline
import pandas as pd

Populating the interactive namespace from numpy and matplotlib

According to Table 1 of Asp et al , individual 3 right atrial appandage is covering 543 "spots", being one of the larger datasets. This data is available on the ST website, I downloaded both tables for Individual3_RAA.

In [22]:

long_table = pd.read_csv('Individual3_RAA_R1.txt', sep='\t')

In [32]:

long_table.head()

Out[32]:

	# gene_name	barcode_id	x	y	reads_count
0	ENSG00000103512	CCTCGAACTCGAACGTTG	10	5	1
1	ENSG00000172209	CCTCGAACTCGAACGTTG	10	5	1
2	ENSG00000138663	CCTCGAACTCGAACGTTG	10	5	1
3	ENSG00000132541	CCTCGAACTCGAACGTTG	10	5	1
4	ENSG00000114126	CCTCGAACTCGAACGTTG	10	5	1

This table is in "long" format, where each row is a combination of gene, x, y, and gene count. This is generally a pretty handy format, but for our analysis, it will be simpler to work on a wide expression matrix.

In [34]:

counts = long_table \
    .pivot(index='barcode_id', columns='# gene_name', values='reads_count') \
    .fillna(0) \
    .astype(int)

In [35]:

counts.iloc[:10, :5]

Out[35]:

# gene_name	ENSG00000000003	ENSG00000000005	ENSG00000000419	ENSG00000000457	ENSG00000000460
barcode_id
ACAAAGAGATATAGCCCT	0	0	0	0	0
ACAACTATGGGTTGGCGG	0	0	0	0	0
ACAAGCCGCGCTGTTACG	0	0	0	0	0
ACACAGATCCTGTTCTGA	0	0	1	0	0
ACACTACGTCCTGTCGAA	0	0	0	0	0
ACATAGACGCCGTCTCGG	0	0	0	0	0
ACATCACCTGCGCGCTCT	0	0	0	0	0
ACATGTATGGCCGCGCCC	0	0	0	0	0
ACATTAACCTGATAACAT	0	0	0	0	0
ACATTTAAGGCGCATGAT	0	0	0	0	0

In [36]:

counts.shape

Out[36]:

(534, 14980)

To keep track of spatial location of each sample, we create a sample information table.

In [39]:

sample_info = long_table.groupby('barcode_id')['x', 'y'].first()

In [46]:

# Just to make sure ecerything is sorted the same
sample_info = sample_info.sort_index()
counts = counts.sort_index()

In [183]:

figsize(5, 5)
plt.scatter(sample_info.x, sample_info.y, c='k', s=50)
plt.gca().invert_yaxis();
plt.axis('equal');
plt.xlabel('Spatial x-axis')
plt.ylabel('Spatial y-axis');

In [53]:

sample_info['total_count'] = counts.sum(1)

In [63]:

# Filter practically unobserved genes, and remove practically empty "spots"
counts = counts.T[counts.sum(0) >= 3].T
sample_info = sample_info.query('total_count > 5').copy()
counts = counts.loc[sample_info.index]

In [89]:

counts.shape

Out[89]:

(530, 12150)

After filtering, there are 530 spatial "spots", where 12,150 genes have been measured.

In [184]:

figsize(5, 5)
plt.scatter(sample_info.x, sample_info.y, c=np.log10(sample_info.total_count), s=50)
plt.gca().invert_yaxis();
plt.axis('equal');
plt.xlabel('Spatial x-axis')
plt.ylabel('Spatial y-axis');

Before running SpatialDE we need to do a couple of things: first variance stabilize the counts, then control for the variation in count depth for the "spots".

In [72]:

import NaiveDE

# Apply Anscombe's transform for NB data
dfm = NaiveDE.stabilize(counts.T).T

# Extract residuals from regression
resids = NaiveDE.regress_out(sample_info, dfm.T, 'np.log(total_count)').T

# Put spatial coordinates in their own table
X = sample_info[['x', 'y']]

Now we run SpatialDE, this takes about 3 minutes on my iMac

In [73]:

%%time
import SpatialDE

results = SpatialDE.run(X, resids)

INFO:root:Performing DE test
INFO:root:Pre-calculating USU^T = K's ...
INFO:root:Done: 0.34s
INFO:root:Fitting gene models
INFO:root:Model 1 of 10
INFO:root:Model 2 of 10                               
INFO:root:Model 3 of 10                               
INFO:root:Model 4 of 10                                
INFO:root:Model 5 of 10                                
INFO:root:Model 6 of 10                               
INFO:root:Model 7 of 10                               
INFO:root:Model 8 of 10                               
INFO:root:Model 9 of 10                               
INFO:root:Model 10 of 10

CPU times: user 2min 40s, sys: 3.15 s, total: 2min 43s
Wall time: 2min 38s

In [76]:

results.head(3).T

Out[76]:

	0	1	2
FSV	2.05897e-09	2.05897e-09	2.05897e-09
M	4	4	4
g	ENSG00000000419	ENSG00000000938	ENSG00000001036
l	0.5	0.5	0.5
max_delta	4.85165e+08	4.85165e+08	4.85165e+08
max_ll	179.568	881.869	327.966
max_mu_hat	0.129841	0.350817	0.186056
max_s2_t_hat	9.60333e-11	2.58001e-10	1.06357e-10
model	SE	SE	SE
n	530	530	530
s2_FSV	0.759402	0.611944	0.0295725
s2_logdelta	1.29702e+17	1.04517e+17	5.05083e+15
time	0.000999928	0.000757933	0.000724792
BIC	-334.044	-1738.65	-630.841
max_ll_null	179.567	881.868	327.966
LLR	0.00047229	0.000472289	0.000472279
pval	0.982662	0.982662	0.982662
qval	0.982662	0.982662	0.982662

In [88]:

results.shape

Out[88]:

(12150, 18)

In [77]:

import SpatialDE.plot

In [80]:

# For numerical reasons, sometimes p-values are set to 0 of they are too small.
# Bring these back to reasonable numbers.
results['pval'] = results['pval'].clip_lower(results.query('pval > 0')['pval'].min() / 2)
results['qval'] = results['qval'].clip_lower(results.query('qval > 0')['qval'].min() / 2)

In [81]:

SpatialDE.plot.FSV_sig(results)

In [85]:

sig_res = results.query('qval < 0.05')

In [86]:

sig_res.shape

Out[86]:

(73, 18)

This means 73 genes are significantly spatially variable. We would like to investigate these, so we need to translate the ENSEMBL IDs in the table to gene names that are easier for people to parse. For this, I went to http://uswest.ensembl.org/biomart and downloaded a table of gene ids and gene names for the human gene annotation. I don't know if this is the annotation Asp et al used, but in most cases the IDs haven't changed in a long time.

In [92]:

gene_info = pd.read_csv('gene_ids.tsv', index_col=0, sep='\t')
gene_info.head()

Out[92]:

	Gene name	Gene description
Gene stable ID
ENSG00000283891	MIR628	microRNA 628 [Source:HGNC Symbol;Acc:HGNC:32884]
ENSG00000251931	RNU6-871P	RNA, U6 small nuclear 871, pseudogene [Source:...
ENSG00000207766	MIR626	microRNA 626 [Source:HGNC Symbol;Acc:HGNC:32882]
ENSG00000275323	AC012314.7	40S ribosomal protein S9 [Source:UniProtKB/Sw...
ENSG00000276678	GHRLOS	Ghrelin opposite strand RNA conserved region [...

In [95]:

results['gene_name'] = results.g.map(gene_info['Gene name'])

In [96]:

sig_res = results.query('qval < 0.05')

What is the general distribution of length scales of significant genes?

In [99]:

sig_res.l.value_counts()

Out[99]:

0.500000     22
4.178220     21
7.103897     17
2.457457      5
0.850110      4
1.445375      3
12.078193     1
Name: l, dtype: int64

For the ST technology used in Asp et al, the length unit for spatial coordinates is 200 µm, so in real-world coordinates, a lengthscale of 0.5 corresponds to 100 µm, while 4.17 corresponds to 834 µm. These are the roughyl scales at which expression levels change in a meaningful way. Now let's see which genes are represented at these scales. We can probably ignore the results for the smallest lengthscale, since it is so much smaller than smallest observed distance, it is likely to indicate a couple of neighboring spots expressing the gene.

In [129]:

grp = sig_res.groupby('l')
for l, group in grp:
    top_string = ', '.join(group.sort_values('pval').head(10)['gene_name'].values)
    if group.shape[0] > 10:
        top_string += ', ...'

    print(f'l = {l}:')
    print(top_string)
    print()

l = 0.5:
PABPC3, CDH23, AMY2A, LARP1B, ZNF117, LRRC4B, PRR15, ZDHHC11, PTPRD, LYSMD1, ...

l = 0.8501104046527316:
ALMS1, SDK1, NPM3, GFRA3

l = 1.445375400197662:
HBA2, HBB, TPM1

l = 2.4574573326742764:
NPPA, MT-CO3, CST3, CFD, DES

l = 4.178220094993103:
DCN, FBLN1, MT-ATP6, MT-ND4, MT-CO1, MT-CO2, MB, PTGDS, MT-CYB, TNNT2, ...

l = 7.10389675136552:
MT-ND2, MT-ND6, MTRNR2L1, MYH6, GAPDH, MT-ND5, ITLN1, ACTC1, CYCS, C1S, ...

l = 12.078193083829134:
MYL7

At this point it is usually fun to google all these genes, and plot what they look like spatially. I leave that as an exercise to the reader, but plot the top gene at each lengthscale so you can get a feeling for how the different lengthscales look.

In [179]:

figsize(10, 6)
top_list = ['PABPC3', 'ALMS1', 'HBA2', 'NPPA', 'DCN', 'MT-ND2', 'MYL7']
for i, g in enumerate(top_list):
    plt.subplot(2, 4, i + 1)
    g_id = results.query('gene_name == @g').g.iloc[0]
    l = results.query('gene_name == @g').l.iloc[0]
    plt.scatter(sample_info.x, sample_info.y, c=resids[g_id], s=25)
    plt.axis('equal');
    plt.gca().invert_yaxis();
    plt.title(f'l = {l:.2}: {g}')
    
plt.tight_layout()

Now let's investigate what histological patterns in the heart tissue is defined by these spatially variable genes.

I'm excluding the genes with very small lengthscales, because they seem independent from eachother. I'm picking 8 patterns, and a lengthscale of 2.0 which should be compatible with the l=1.4 genes at the same time as using the spatial prior for genes with larger l.

In [238]:

%%time
histology_results, patterns = SpatialDE.spatial_patterns(X, res, sig_res.query('l > 0.8'), 8, 2.0, verbosity=1)

iter 0, ELBO: -2.60e+10
iter 1, ELBO: -1.31e+10, delta_ELBO: 1.29e+10
iter 2, ELBO: -1.31e+10, delta_ELBO: 2.89e+03
iter 3, ELBO: -1.31e+10, delta_ELBO: 1.60e+03
iter 4, ELBO: -1.31e+10, delta_ELBO: 5.32e+02
iter 5, ELBO: -1.31e+10, delta_ELBO: 6.26e+01
iter 6, ELBO: -1.31e+10, delta_ELBO: 8.26e+01
iter 7, ELBO: -1.31e+10, delta_ELBO: 1.22e+00
iter 8, ELBO: -1.31e+10, delta_ELBO: 1.69e-02
iter 9, ELBO: -1.31e+10, delta_ELBO: 2.44e-04
Converged on iter 9
CPU times: user 26.1 s, sys: 1.64 s, total: 27.7 s
Wall time: 14.2 s

In [239]:

histology_results.pattern.value_counts()

Out[239]:

7    11
1    10
6     9
3     8
0     6
2     3
5     2
4     2
Name: pattern, dtype: int64

In [240]:

# Again put in interpretable gene names
histology_results['gene_name'] = histology_results.g.map(gene_info['Gene name'])

In [241]:

for i in histology_results.sort_values('pattern').pattern.unique():
    
    print('Pattern {}'.format(i))
    print(', '.join(histology_results.query('pattern == @i').sort_values('membership')['gene_name'].tolist()))
    print()

Pattern 0
SLC25A4, ATP5A1, TPM1, TTN, ACTC1, MYH6

Pattern 1
MTRNR2L1, MT-ND1, MT-CO3, MT-CO2, MT-CYB, MT-CO1, MT-ND3, MT-ND4, MT-ATP6, MT-ND2

Pattern 2
ALMS1, KRT7, ITLN1

Pattern 3
MTRNR2L12, WISP2, MT-ND6, SDK1, RPL32, MGP, MT-ND5, MT-ND4L

Pattern 4
LDB3, NPM3

Pattern 5
HBA2, HBB

Pattern 6
GFRA3, C1R, C1S, CST3, CFD, DCN, FBLN1, FTL, C3

Pattern 7
CYCS, PTGDS, DES, TNNT2, NPPA, S100A1, MB, CHCHD10, GAPDH, ATP2A2, MYL7

These groups of genes can then be directly related to underlying spatial patterns.

In [245]:

figsize(13, 5.5)
for i, Ci in enumerate(histology_results.sort_values('pattern').pattern.unique()):
    C = patterns[Ci]
    plt.subplot(2, 4, i + 1)
    plt.scatter(sample_info.x, sample_info.y, c=C, s=25)
    plt.axis('equal');
    plt.gca().invert_yaxis();
    plt.title(f'Pattern {i}')
    
plt.tight_layout()

In [ ]: