See the GitHub repo for more info!
The mastiff server is unpublished work by Luiz Irber - please file an issue if you are interested in how to cite it.
Import all the Python libraries we need to use
import pandas as pd
import os
import sourmash, screed
import io
import urllib3
This is a precomputed CSV that we download once per instance; it's used to summarize the ScientificName
(i.e. metagenome type) of matches at the bottom.
!wget --no-clobber https://osf.io/download/762mk/ -O sra.runinfo.csv.gz
File ‘sra.runinfo.csv.gz’ already there; not retrieving.
run_info = pd.read_csv('sra.runinfo.csv.gz')
print(f"Loaded {len(run_info)} records from sra.runinfo.csv.gz")
run_info.head()
Loaded 702013 records from sra.runinfo.csv.gz
Unnamed: 0 | Run | ScientificName | |
---|---|---|---|
0 | 0 | SRR18036904 | bovine metagenome |
1 | 1 | SRR18036905 | bovine metagenome |
2 | 2 | SRR18036906 | bovine metagenome |
3 | 3 | SRR18036907 | bovine metagenome |
4 | 4 | SRR18036908 | bovine metagenome |
In order to query mastiff with a genome, we need to sketch the genome using sourmash. Here use the Python interface for sourmash.
The below code loads all of the sequence data using the screed library, and feeds it into a sourmash FracMinHash sketch.
You can replace the QUERY_SEQUENCE_FILE
with your own query filename, if you like. It should be over 10kb in size (ideally you want ~5-10 hashes to result in order for the query to be robust).
# generate sourmash sketch
QUERY_SEQUENCE_FILE='sequences/shewanella.fa.gz'
total_bp = 0
sketch = sourmash.MinHash(0, 21, scaled=1000)
with screed.open(QUERY_SEQUENCE_FILE) as records:
for record in records:
sketch.add_sequence(record.sequence, force=True)
total_bp += len(record.sequence)
print(f"generated {len(sketch)} hashes by sketching {total_bp:g} bp from '{QUERY_SEQUENCE_FILE}'")
generated 5307 hashes by sketching 5.31291e+06 bp from 'sequences/shewanella.fa.gz'
Now, we convert the sourmash sketch into gzipped JSON bytes.
# serialize to SourmashSignature / gzipped JSON
ss = sourmash.SourmashSignature(sketch)
buf = io.BytesIO()
sourmash.save_signatures([ss], buf, compression=True)
print(f"serialized sourmash signature into {len(buf.getvalue())} bytes.")
serialized sourmash signature into 44044 bytes.
Next, use urllib3
to contact the mastiff server, send in the sourmash signature information, and get the results.
# query mastiff
http = urllib3.PoolManager()
r = http.request('POST',
'https://mastiff.sourmash.bio/search',
body = buf.getvalue(),
headers = { 'Content-Type': 'application/json'})
query_results_text = r.data.decode('utf-8')
mastiff returns the results in CSV format, with two columns - the SRA accession, and the containment index of the query.
The containment index is calculated as the number of hashes in the query that are present in a matching metagenome, divided by the total number of hashes in the query. A containment of 1.0 means every hash in the query sketch is present, and estimates that the original query sequence is entirely present; a containment of 0.5 means that 50% of the query sketch is present, suggesting that only about half of the query sequence is actually in the metagenome.
# load results into a pandas data frame
results_wrap_fp = io.StringIO(query_results_text)
mastiff0_df = pd.read_csv(results_wrap_fp)
print(f"Loaded {len(mastiff0_df)} mastiff results into a dataframe!")
Loaded 5702 mastiff results into a dataframe!
Look at the best and worst matches.
mastiff0_df.head()
SRA accession | containment | |
---|---|---|
0 | ERR1254310 | 1.000000 |
1 | ERR1254284 | 1.000000 |
2 | ERR1254285 | 0.999812 |
3 | ERR1254311 | 0.999623 |
4 | SRR606249 | 0.999246 |
mastiff0_df.tail()
SRA accession | containment | |
---|---|---|
5697 | SRR5918872 | 0.009422 |
5698 | SRR5207236 | 0.009422 |
5699 | SRR5499129 | 0.009422 |
5700 | SRR8500488 | 0.009422 |
5701 | ERR599067 | 0.009422 |
Per our work on containment ANI, this corresponds to a minimum of 92% average nucleotide identity.
THRESHOLD=0.2
mastiff_df = mastiff0_df[mastiff0_df['containment'] >= THRESHOLD]
print(f"Filtering results at a containment of >= {THRESHOLD:.2f}; {len(mastiff_df)} of {len(mastiff0_df)} left.")
Filtering results at a containment of >= 0.20; 312 of 5702 left.
Scientific Name
¶Now, correlate results with the ScientificName
metadata, which records which "kind" of metagenome this is.
mastiff2_df = mastiff_df.set_index('SRA accession').join(run_info.set_index('Run')['ScientificName'])
mastiff2_df.head()
containment | ScientificName | |
---|---|---|
SRA accession | ||
ERR1254310 | 1.000000 | metagenome |
ERR1254284 | 1.000000 | metagenome |
ERR1254285 | 0.999812 | metagenome |
ERR1254311 | 0.999623 | metagenome |
SRR606249 | 0.999246 | synthetic metagenome |
# how many have 'null' scientific name?
null_df = mastiff2_df[mastiff2_df['ScientificName'].isnull()]
print(len(null_df))
9
mastiff3_df = mastiff2_df[~mastiff2_df['ScientificName'].isnull()]
print(f"Of {len(mastiff2_df)} MAGsearch results, {len(mastiff3_df)} have non-null ScientificName")
mastiff3_df.head()
Of 312 MAGsearch results, 303 have non-null ScientificName
containment | ScientificName | |
---|---|---|
SRA accession | ||
ERR1254310 | 1.000000 | metagenome |
ERR1254284 | 1.000000 | metagenome |
ERR1254285 | 0.999812 | metagenome |
ERR1254311 | 0.999623 | metagenome |
SRR606249 | 0.999246 | synthetic metagenome |
Explore which types of metagenomes tend to contain this query.
# what are the top ScientificNames of the matches?
mastiff3_df["ScientificName"].value_counts()[:20]
metagenome 111 freshwater metagenome 32 wastewater metagenome 30 oyster metagenome 22 aquifer metagenome 17 marine metagenome 11 food metagenome 11 aquatic metagenome 9 biofilm metagenome 9 seawater metagenome 9 groundwater metagenome 6 synthetic metagenome 5 bioreactor metagenome 3 human skin metagenome 3 soil metagenome 2 money metagenome 2 fish metagenome 2 fish gut metagenome 2 oral metagenome 2 aquatic viral metagenome 2 Name: ScientificName, dtype: int64