Since our paper on SpatialDE was published today in Nature Methods, I wanted to demonstrate how to use our method on a dataset which came out while the paper was under review.
I spent a few hours this afternoon (March 19), doing this analysis. The data is from Asp et al published in Scientific Reports in October last year. The authors use the Spatial Transcriptomics method to investigate spatial gene expression in cardiac biopsies.
%pylab inline
import pandas as pd
Populating the interactive namespace from numpy and matplotlib
According to Table 1 of Asp et al , individual 3 right atrial appandage is covering 543 "spots", being one of the larger datasets. This data is available on the ST website, I downloaded both tables for Individual3_RAA
.
long_table = pd.read_csv('Individual3_RAA_R1.txt', sep='\t')
long_table.head()
# gene_name | barcode_id | x | y | reads_count | |
---|---|---|---|---|---|
0 | ENSG00000103512 | CCTCGAACTCGAACGTTG | 10 | 5 | 1 |
1 | ENSG00000172209 | CCTCGAACTCGAACGTTG | 10 | 5 | 1 |
2 | ENSG00000138663 | CCTCGAACTCGAACGTTG | 10 | 5 | 1 |
3 | ENSG00000132541 | CCTCGAACTCGAACGTTG | 10 | 5 | 1 |
4 | ENSG00000114126 | CCTCGAACTCGAACGTTG | 10 | 5 | 1 |
This table is in "long" format, where each row is a combination of gene, x, y, and gene count. This is generally a pretty handy format, but for our analysis, it will be simpler to work on a wide expression matrix.
counts = long_table \
.pivot(index='barcode_id', columns='# gene_name', values='reads_count') \
.fillna(0) \
.astype(int)
counts.iloc[:10, :5]
# gene_name | ENSG00000000003 | ENSG00000000005 | ENSG00000000419 | ENSG00000000457 | ENSG00000000460 |
---|---|---|---|---|---|
barcode_id | |||||
ACAAAGAGATATAGCCCT | 0 | 0 | 0 | 0 | 0 |
ACAACTATGGGTTGGCGG | 0 | 0 | 0 | 0 | 0 |
ACAAGCCGCGCTGTTACG | 0 | 0 | 0 | 0 | 0 |
ACACAGATCCTGTTCTGA | 0 | 0 | 1 | 0 | 0 |
ACACTACGTCCTGTCGAA | 0 | 0 | 0 | 0 | 0 |
ACATAGACGCCGTCTCGG | 0 | 0 | 0 | 0 | 0 |
ACATCACCTGCGCGCTCT | 0 | 0 | 0 | 0 | 0 |
ACATGTATGGCCGCGCCC | 0 | 0 | 0 | 0 | 0 |
ACATTAACCTGATAACAT | 0 | 0 | 0 | 0 | 0 |
ACATTTAAGGCGCATGAT | 0 | 0 | 0 | 0 | 0 |
counts.shape
(534, 14980)
To keep track of spatial location of each sample, we create a sample information table.
sample_info = long_table.groupby('barcode_id')['x', 'y'].first()
# Just to make sure ecerything is sorted the same
sample_info = sample_info.sort_index()
counts = counts.sort_index()
figsize(5, 5)
plt.scatter(sample_info.x, sample_info.y, c='k', s=50)
plt.gca().invert_yaxis();
plt.axis('equal');
plt.xlabel('Spatial x-axis')
plt.ylabel('Spatial y-axis');
sample_info['total_count'] = counts.sum(1)
# Filter practically unobserved genes, and remove practically empty "spots"
counts = counts.T[counts.sum(0) >= 3].T
sample_info = sample_info.query('total_count > 5').copy()
counts = counts.loc[sample_info.index]
counts.shape
(530, 12150)
After filtering, there are 530 spatial "spots", where 12,150 genes have been measured.
figsize(5, 5)
plt.scatter(sample_info.x, sample_info.y, c=np.log10(sample_info.total_count), s=50)
plt.gca().invert_yaxis();
plt.axis('equal');
plt.xlabel('Spatial x-axis')
plt.ylabel('Spatial y-axis');
Before running SpatialDE
we need to do a couple of things: first variance stabilize the counts, then control for the variation in count depth for the "spots".
import NaiveDE
# Apply Anscombe's transform for NB data
dfm = NaiveDE.stabilize(counts.T).T
# Extract residuals from regression
resids = NaiveDE.regress_out(sample_info, dfm.T, 'np.log(total_count)').T
# Put spatial coordinates in their own table
X = sample_info[['x', 'y']]
Now we run SpatialDE, this takes about 3 minutes on my iMac
%%time
import SpatialDE
results = SpatialDE.run(X, resids)
INFO:root:Performing DE test INFO:root:Pre-calculating USU^T = K's ... INFO:root:Done: 0.34s INFO:root:Fitting gene models INFO:root:Model 1 of 10 INFO:root:Model 2 of 10 INFO:root:Model 3 of 10 INFO:root:Model 4 of 10 INFO:root:Model 5 of 10 INFO:root:Model 6 of 10 INFO:root:Model 7 of 10 INFO:root:Model 8 of 10 INFO:root:Model 9 of 10 INFO:root:Model 10 of 10
CPU times: user 2min 40s, sys: 3.15 s, total: 2min 43s Wall time: 2min 38s
results.head(3).T
0 | 1 | 2 | |
---|---|---|---|
FSV | 2.05897e-09 | 2.05897e-09 | 2.05897e-09 |
M | 4 | 4 | 4 |
g | ENSG00000000419 | ENSG00000000938 | ENSG00000001036 |
l | 0.5 | 0.5 | 0.5 |
max_delta | 4.85165e+08 | 4.85165e+08 | 4.85165e+08 |
max_ll | 179.568 | 881.869 | 327.966 |
max_mu_hat | 0.129841 | 0.350817 | 0.186056 |
max_s2_t_hat | 9.60333e-11 | 2.58001e-10 | 1.06357e-10 |
model | SE | SE | SE |
n | 530 | 530 | 530 |
s2_FSV | 0.759402 | 0.611944 | 0.0295725 |
s2_logdelta | 1.29702e+17 | 1.04517e+17 | 5.05083e+15 |
time | 0.000999928 | 0.000757933 | 0.000724792 |
BIC | -334.044 | -1738.65 | -630.841 |
max_ll_null | 179.567 | 881.868 | 327.966 |
LLR | 0.00047229 | 0.000472289 | 0.000472279 |
pval | 0.982662 | 0.982662 | 0.982662 |
qval | 0.982662 | 0.982662 | 0.982662 |
results.shape
(12150, 18)
import SpatialDE.plot
# For numerical reasons, sometimes p-values are set to 0 of they are too small.
# Bring these back to reasonable numbers.
results['pval'] = results['pval'].clip_lower(results.query('pval > 0')['pval'].min() / 2)
results['qval'] = results['qval'].clip_lower(results.query('qval > 0')['qval'].min() / 2)
SpatialDE.plot.FSV_sig(results)
sig_res = results.query('qval < 0.05')
sig_res.shape
(73, 18)
This means 73 genes are significantly spatially variable. We would like to investigate these, so we need to translate the ENSEMBL IDs in the table to gene names that are easier for people to parse. For this, I went to http://uswest.ensembl.org/biomart and downloaded a table of gene ids and gene names for the human gene annotation. I don't know if this is the annotation Asp et al used, but in most cases the IDs haven't changed in a long time.
gene_info = pd.read_csv('gene_ids.tsv', index_col=0, sep='\t')
gene_info.head()
Gene name | Gene description | |
---|---|---|
Gene stable ID | ||
ENSG00000283891 | MIR628 | microRNA 628 [Source:HGNC Symbol;Acc:HGNC:32884] |
ENSG00000251931 | RNU6-871P | RNA, U6 small nuclear 871, pseudogene [Source:... |
ENSG00000207766 | MIR626 | microRNA 626 [Source:HGNC Symbol;Acc:HGNC:32882] |
ENSG00000275323 | AC012314.7 | 40S ribosomal protein S9 [Source:UniProtKB/Sw... |
ENSG00000276678 | GHRLOS | Ghrelin opposite strand RNA conserved region [... |
results['gene_name'] = results.g.map(gene_info['Gene name'])
sig_res = results.query('qval < 0.05')
What is the general distribution of length scales of significant genes?
sig_res.l.value_counts()
0.500000 22 4.178220 21 7.103897 17 2.457457 5 0.850110 4 1.445375 3 12.078193 1 Name: l, dtype: int64
For the ST technology used in Asp et al, the length unit for spatial coordinates is 200 µm, so in real-world coordinates, a lengthscale of 0.5 corresponds to 100 µm, while 4.17 corresponds to 834 µm. These are the roughyl scales at which expression levels change in a meaningful way. Now let's see which genes are represented at these scales. We can probably ignore the results for the smallest lengthscale, since it is so much smaller than smallest observed distance, it is likely to indicate a couple of neighboring spots expressing the gene.
grp = sig_res.groupby('l')
for l, group in grp:
top_string = ', '.join(group.sort_values('pval').head(10)['gene_name'].values)
if group.shape[0] > 10:
top_string += ', ...'
print(f'l = {l}:')
print(top_string)
print()
l = 0.5: PABPC3, CDH23, AMY2A, LARP1B, ZNF117, LRRC4B, PRR15, ZDHHC11, PTPRD, LYSMD1, ... l = 0.8501104046527316: ALMS1, SDK1, NPM3, GFRA3 l = 1.445375400197662: HBA2, HBB, TPM1 l = 2.4574573326742764: NPPA, MT-CO3, CST3, CFD, DES l = 4.178220094993103: DCN, FBLN1, MT-ATP6, MT-ND4, MT-CO1, MT-CO2, MB, PTGDS, MT-CYB, TNNT2, ... l = 7.10389675136552: MT-ND2, MT-ND6, MTRNR2L1, MYH6, GAPDH, MT-ND5, ITLN1, ACTC1, CYCS, C1S, ... l = 12.078193083829134: MYL7
At this point it is usually fun to google all these genes, and plot what they look like spatially. I leave that as an exercise to the reader, but plot the top gene at each lengthscale so you can get a feeling for how the different lengthscales look.
figsize(10, 6)
top_list = ['PABPC3', 'ALMS1', 'HBA2', 'NPPA', 'DCN', 'MT-ND2', 'MYL7']
for i, g in enumerate(top_list):
plt.subplot(2, 4, i + 1)
g_id = results.query('gene_name == @g').g.iloc[0]
l = results.query('gene_name == @g').l.iloc[0]
plt.scatter(sample_info.x, sample_info.y, c=resids[g_id], s=25)
plt.axis('equal');
plt.gca().invert_yaxis();
plt.title(f'l = {l:.2}: {g}')
plt.tight_layout()
Now let's investigate what histological patterns in the heart tissue is defined by these spatially variable genes.
I'm excluding the genes with very small lengthscales, because they seem independent from eachother. I'm picking 8 patterns, and a lengthscale of 2.0 which should be compatible with the l=1.4 genes at the same time as using the spatial prior for genes with larger l.
%%time
histology_results, patterns = SpatialDE.spatial_patterns(X, res, sig_res.query('l > 0.8'), 8, 2.0, verbosity=1)
iter 0, ELBO: -2.60e+10 iter 1, ELBO: -1.31e+10, delta_ELBO: 1.29e+10 iter 2, ELBO: -1.31e+10, delta_ELBO: 2.89e+03 iter 3, ELBO: -1.31e+10, delta_ELBO: 1.60e+03 iter 4, ELBO: -1.31e+10, delta_ELBO: 5.32e+02 iter 5, ELBO: -1.31e+10, delta_ELBO: 6.26e+01 iter 6, ELBO: -1.31e+10, delta_ELBO: 8.26e+01 iter 7, ELBO: -1.31e+10, delta_ELBO: 1.22e+00 iter 8, ELBO: -1.31e+10, delta_ELBO: 1.69e-02 iter 9, ELBO: -1.31e+10, delta_ELBO: 2.44e-04 Converged on iter 9 CPU times: user 26.1 s, sys: 1.64 s, total: 27.7 s Wall time: 14.2 s
histology_results.pattern.value_counts()
7 11 1 10 6 9 3 8 0 6 2 3 5 2 4 2 Name: pattern, dtype: int64
# Again put in interpretable gene names
histology_results['gene_name'] = histology_results.g.map(gene_info['Gene name'])
for i in histology_results.sort_values('pattern').pattern.unique():
print('Pattern {}'.format(i))
print(', '.join(histology_results.query('pattern == @i').sort_values('membership')['gene_name'].tolist()))
print()
Pattern 0 SLC25A4, ATP5A1, TPM1, TTN, ACTC1, MYH6 Pattern 1 MTRNR2L1, MT-ND1, MT-CO3, MT-CO2, MT-CYB, MT-CO1, MT-ND3, MT-ND4, MT-ATP6, MT-ND2 Pattern 2 ALMS1, KRT7, ITLN1 Pattern 3 MTRNR2L12, WISP2, MT-ND6, SDK1, RPL32, MGP, MT-ND5, MT-ND4L Pattern 4 LDB3, NPM3 Pattern 5 HBA2, HBB Pattern 6 GFRA3, C1R, C1S, CST3, CFD, DCN, FBLN1, FTL, C3 Pattern 7 CYCS, PTGDS, DES, TNNT2, NPPA, S100A1, MB, CHCHD10, GAPDH, ATP2A2, MYL7
These groups of genes can then be directly related to underlying spatial patterns.
figsize(13, 5.5)
for i, Ci in enumerate(histology_results.sort_values('pattern').pattern.unique()):
C = patterns[Ci]
plt.subplot(2, 4, i + 1)
plt.scatter(sample_info.x, sample_info.y, c=C, s=25)
plt.axis('equal');
plt.gca().invert_yaxis();
plt.title(f'Pattern {i}')
plt.tight_layout()