Search ~500,000 metagenomes for matches to a genome sequence¶

The mastiff server is unpublished work by Luiz Irber - please file an issue if you are interested in how to cite it.

Python miscellany¶

Import all the Python libraries we need to use

In [1]:

import pandas as pd
import os
import sourmash, screed
import io
import urllib3

Download SRA metadata information.¶

This is a precomputed CSV that we download once per instance; it's used to summarize the ScientificName (i.e. metagenome type) of matches at the bottom.

In [2]:

!wget --no-clobber https://osf.io/download/762mk/ -O sra.runinfo.csv.gz

File ‘sra.runinfo.csv.gz’ already there; not retrieving.

Load the SRA metadata into a pandas dataframe¶

In [3]:

run_info = pd.read_csv('sra.runinfo.csv.gz')
print(f"Loaded {len(run_info)} records from sra.runinfo.csv.gz")
run_info.head()

Loaded 702013 records from sra.runinfo.csv.gz

Out[3]:

	Unnamed: 0	Run	ScientificName
0	0	SRR18036904	bovine metagenome
1	1	SRR18036905	bovine metagenome
2	2	SRR18036906	bovine metagenome
3	3	SRR18036907	bovine metagenome
4	4	SRR18036908	bovine metagenome

Generate a sourmash sketch for a query¶

In order to query mastiff with a genome, we need to sketch the genome using sourmash. Here use the Python interface for sourmash.

The below code loads all of the sequence data using the screed library, and feeds it into a sourmash FracMinHash sketch.

You can replace the QUERY_SEQUENCE_FILE with your own query filename, if you like. It should be over 10kb in size (ideally you want ~5-10 hashes to result in order for the query to be robust).

In [4]:

# generate sourmash sketch
QUERY_SEQUENCE_FILE='sequences/shewanella.fa.gz'

total_bp = 0

sketch = sourmash.MinHash(0, 21, scaled=1000)
with screed.open(QUERY_SEQUENCE_FILE) as records:
    for record in records:
        sketch.add_sequence(record.sequence, force=True)
        total_bp += len(record.sequence)
        
print(f"generated {len(sketch)} hashes by sketching {total_bp:g} bp from '{QUERY_SEQUENCE_FILE}'")

generated 5307 hashes by sketching 5.31291e+06 bp from 'sequences/shewanella.fa.gz'

Serialize the sourmash sketch¶

Now, we convert the sourmash sketch into gzipped JSON bytes.

In [5]:

# serialize to SourmashSignature / gzipped JSON
ss = sourmash.SourmashSignature(sketch)

buf = io.BytesIO()
sourmash.save_signatures([ss], buf, compression=True)
print(f"serialized sourmash signature into {len(buf.getvalue())} bytes.")

serialized sourmash signature into 44044 bytes.

Query the mastiff server¶

Next, use urllib3 to contact the mastiff server, send in the sourmash signature information, and get the results.

In [6]:

# query mastiff
http = urllib3.PoolManager()
r = http.request('POST',
                 'https://mastiff.sourmash.bio/search',
                 body = buf.getvalue(),
                 headers = { 'Content-Type': 'application/json'})

query_results_text = r.data.decode('utf-8')

Parse the mastiff results as CSV¶

mastiff returns the results in CSV format, with two columns - the SRA accession, and the containment index of the query.

The containment index is calculated as the number of hashes in the query that are present in a matching metagenome, divided by the total number of hashes in the query. A containment of 1.0 means every hash in the query sketch is present, and estimates that the original query sequence is entirely present; a containment of 0.5 means that 50% of the query sketch is present, suggesting that only about half of the query sequence is actually in the metagenome.

In [7]:

# load results into a pandas data frame
results_wrap_fp = io.StringIO(query_results_text)
mastiff0_df = pd.read_csv(results_wrap_fp)

print(f"Loaded {len(mastiff0_df)} mastiff results into a dataframe!")

Loaded 5702 mastiff results into a dataframe!

Explore the results¶

Look at the best and worst matches.

In [8]:

mastiff0_df.head()

Out[8]:

	SRA accession	containment
0	ERR1254310	1.000000
1	ERR1254284	1.000000
2	ERR1254285	0.999812
3	ERR1254311	0.999623
4	SRR606249	0.999246

In [9]:

mastiff0_df.tail()

Out[9]:

	SRA accession	containment
5697	SRR5918872	0.009422
5698	SRR5207236	0.009422
5699	SRR5499129	0.009422
5700	SRR8500488	0.009422
5701	ERR599067	0.009422

Discard matches below a containment of 0.2¶

Per our work on containment ANI, this corresponds to a minimum of 92% average nucleotide identity.

In [10]:

THRESHOLD=0.2

mastiff_df = mastiff0_df[mastiff0_df['containment'] >= THRESHOLD]
print(f"Filtering results at a containment of >= {THRESHOLD:.2f}; {len(mastiff_df)} of {len(mastiff0_df)} left.")

Filtering results at a containment of >= 0.20; 312 of 5702 left.

Explore metagenome "types" through `Scientific Name`¶

Now, correlate results with the ScientificName metadata, which records which "kind" of metagenome this is.

In [11]:

mastiff2_df = mastiff_df.set_index('SRA accession').join(run_info.set_index('Run')['ScientificName'])
mastiff2_df.head()

Out[11]:

	containment	ScientificName
SRA accession
ERR1254310	1.000000	metagenome
ERR1254284	1.000000	metagenome
ERR1254285	0.999812	metagenome
ERR1254311	0.999623	metagenome
SRR606249	0.999246	synthetic metagenome

In [12]:

# how many have 'null' scientific name?
null_df = mastiff2_df[mastiff2_df['ScientificName'].isnull()]
print(len(null_df))

In [13]:

mastiff3_df = mastiff2_df[~mastiff2_df['ScientificName'].isnull()]
print(f"Of {len(mastiff2_df)} MAGsearch results, {len(mastiff3_df)} have non-null ScientificName")
mastiff3_df.head()

Of 312 MAGsearch results, 303 have non-null ScientificName

Out[13]:

	containment	ScientificName
SRA accession
ERR1254310	1.000000	metagenome
ERR1254284	1.000000	metagenome
ERR1254285	0.999812	metagenome
ERR1254311	0.999623	metagenome
SRR606249	0.999246	synthetic metagenome

Group and aggregate by metagenome type¶

Explore which types of metagenomes tend to contain this query.

In [14]:

# what are the top ScientificNames of the matches?
mastiff3_df["ScientificName"].value_counts()[:20]

Out[14]:

metagenome                  111
freshwater metagenome        32
wastewater metagenome        30
oyster metagenome            22
aquifer metagenome           17
marine metagenome            11
food metagenome              11
aquatic metagenome            9
biofilm metagenome            9
seawater metagenome           9
groundwater metagenome        6
synthetic metagenome          5
bioreactor metagenome         3
human skin metagenome         3
soil metagenome               2
money metagenome              2
fish metagenome               2
fish gut metagenome           2
oral metagenome               2
aquatic viral metagenome      2
Name: ScientificName, dtype: int64