Your new friends! MMseqs2 ultra fast and sensitive protein search, Linclust clustering huge protein sequence sets in linear time and Plass Protein-level assembly to increases protein sequence recovery from complex metagenomes.
Here you will learn the basic usage of MMseqs2 as well as expert tricks to take advantage of the ability of chaining different MMseqs2 modules to produce custom workflows.
We will use Conda to setup all required software for this tutorial. If you haven't setup Conda yet, please do so first and then execute:
conda create -n tutorial mmseqs2 plass megahit prodigal hmmer sra-tools
conda activate tutorial
The generic syntax for mmseqs
and plass
calls is always the
following:
mmseqs <command> <db1> [<db2> ...] --flag --parameter 42
The help text of mmseqs
shows, by default, only the most important
parameters and modules. To see a full list of parameters and modules use
the -h
flag.
If you are using Bash as your shell, you can activate tab-auto-completion of commands and parameters:
source $CONDA_PREFIX/util/bash-completion.sh
A 61-year-old man was admitted in December 2016 with bilateral headache, gait instability, lethargy, and confusion. Because of multiple tick bites in the preceding 2 weeks, he was prescribed the antibiotic doxycycline for presumed Lyme disease. Over the next 48 hours, he developed worsening confusion, weakness, and ataxia. He returned to the referring hospital and was admitted. He lived in a heavily wooded area in New Hampshire, had frequent tick exposures, and worked as a construction contractor in basements with uncertain rodent and bat exposures. His symptoms were diagnosed as Encephalitis and the causative agent --- not known.
disease.**
This pathogen is usually confirmed by a screening antibody test, followed by a plaque reduction neutralization test. However, this takes 5 weeks, which was too slow to affect the patient's care. As traditional tests done in the first week of the patient's hospital stay did not reveal any conclusive disease cause, the doctors were running out of options. Therefore a novel metagenomic analysis was performed.
Metagenomic sequencing from cerebrospinal fluid was performed on hospital day 8. It returned 14 million short nucleotide sequences (reads).
The authors of the study removed all human reads using Kraken (Wood and Salzberg 2014) and released a much smaller set of 226,908 reads on the SRA (https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi). Kraken extracts short nucleotide subsequences of length k, also called k-mers, and compares them to a reference database where k-mers point to taxonomic labels. In case of exact matching it is able to assign taxonomy.
human reads. Why?**
many bases are publicly available on the SRA in total?**
We will use the sequence search tool MMseqs2 (Steinegger and Soeding 2017) to find the cause of this patient's disease. MMseqs2 translates the nucleotide reads to putative protein fragments, searches against a protein reference database and assigns taxonomic labels based on the found reference database hits.
viral pathogens? How does this compare with Kraken's approach?**
To not spoil the mystery to early, we prepared a FASTA file containing the reads. Download this file first:
!wget -c http://wwwuser.gwdg.de/~compbiol/mmseqs2/tutorials/mystery_reads.fasta
--2019-12-02 21:24:11-- http://wwwuser.gwdg.de/~compbiol/mmseqs2/tutorials/mystery_reads.fasta Resolving wwwuser.gwdg.de (wwwuser.gwdg.de)... 134.76.10.111 Connecting to wwwuser.gwdg.de (wwwuser.gwdg.de)|134.76.10.111|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 49930921 (48M) Saving to: ‘mystery_reads.fasta’ mystery_reads.fasta 100%[===================>] 47.62M 1.15MB/s in 16s 2019-12-02 21:24:28 (2.93 MB/s) - ‘mystery_reads.fasta’ saved [49930921/49930921]
The sequencing machine returns paired-end reads where sequencing starts in opposite directions from two close-by points to cover the same genomic region. Some of these paired reads overlap enough to be merged into a single read with FLASH (Magoc and Salzberg 2011).
We will also need a reference database of proteins. For this we will use the Swiss-Prot which is the manually curated, high-quality part of the Uniprot (Consortium 2014) protein reference database. We are using this smaller subset of about 500,000 proteins, since the full Uniprot with over 175,000,000 sequences requires too many computational resources. Each protein in Uniprot has a taxonomic label. Let us prepare this database:
Download the database:
!wget -c http://wwwuser.gwdg.de/~compbiol/mmseqs2/tutorials/uniprot_sprot_2018_03.fasta.gz
!wget -c http://wwwuser.gwdg.de/~compbiol/mmseqs2/tutorials/uniprot_sprot_2018_03_mapping.tsv.gz
!gunzip uniprot_sprot_2018_03_mapping.tsv.gz
Create a taxonomically annotated sequence database
!mmseqs createdb uniprot_sprot_2018_03.fasta.gz uniprot_sprot
uniprot_sprot exists and will be overwritten.
createdb uniprot_sprot_2018_03.fasta.gz uniprot_sprot
MMseqs Version: 10.6d92c
Max sequence length 65535
Split seq. by length true
Database type 0
Do not shuffle input database true
Offset of numeric ids 0
Compressed 0
Verbosity 3
Converting sequences
[556915] 3s 18mss
Time for merging into uniprot_sprot_h by mergeResults: 0h 0m 0s 470ms
Time for merging into uniprot_sprot by mergeResults: 0h 0m 0s 583ms
Time for merging into uniprot_sprot.lookup by mergeResults: 0h 0m 0s 123ms
Time for processing: 0h 0m 5s 311ms
!mmseqs createtaxdb uniprot_sprot tmp --tax-mapping-file uniprot_sprot_2018_03_mapping.tsv
createtaxdb uniprot_sprot tmp --tax-mapping-file uniprot_sprot_2018_03_mapping.tsv MMseqs Version: 10.6d92c NCBI tax dump directory Taxonomical mapping file uniprot_sprot_2018_03_mapping.tsv Threads 32 Verbosity 3 Download taxdump.tar.gz 2019-12-02 21:32:36 URL: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz [50529876] -> "-" [1] Database created
Through a similarity search we will transfer the annotation of the reference protein to our unknown reads.
!mmseqs createdb mystery_reads.fasta reads
reads exists and will be overwritten. createdb mystery_reads.fasta reads MMseqs Version: 10.6d92c Max sequence length 65535 Split seq. by length true Database type 0 Do not shuffle input database true Offset of numeric ids 0 Compressed 0 Verbosity 3 Assuming DNA database, forcing parameter --dont-split-seq-by-len true Converting sequences [410667] 0s 692ms Time for merging into reads_h by mergeResults: 0h 0m 0s 275ms Time for merging into reads by mergeResults: 0h 0m 0s 297ms Time for merging into reads.lookup by mergeResults: 0h 0m 0s 94ms Time for processing: 0h 0m 2s 95ms
!mmseqs taxonomy reads uniprot_sprot lca_result tmp -s 2
taxonomy reads uniprot_sprot lca_result tmp -s 2 MMseqs Version: 10.6d92c Substitution matrix nucl:nucleotide.out,aa:blosum62.out Add backtrace false Alignment mode 2 E-value threshold 1 Seq. id. threshold 0 Min. alignment length 0 Seq. id. mode 0 Alternative alignments 0 Coverage threshold 0 Coverage mode 0 Max sequence length 65535 Compositional bias 1 Realign hits false Max reject 2147483647 Max accept 2147483647 Include identical seq. id. false Preload mode 0 Pseudo count a 1 Pseudo count b 1.5 Score bias 0 Gap open cost 11 Gap extension cost 1 Threads 32 Compressed 0 Verbosity 3 Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out Sensitivity 2 K-mer size 0 K-score 2147483647 Alphabet size 21 Max results per query 300 Split database 0 Split mode 2 Split memory limit 0 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask lower case residues 0 Minimum diagonal score 15 Spaced k-mers 1 Spaced k-mer pattern Local temporary path Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Mask profile 1 Profile e-value threshold 0.001 Use global sequence weighting false Filter MSA 1 Maximum seq. id. threshold 0.9 Minimum seq. id. 0 Minimum score per column -20 Minimum coverage 0 Select N most diverse seqs 1000 Omit consensus false Min codons in orf 30 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward frames 1,2,3 Reverse frames 1,2,3 Translation table 1 Translate orf 0 Use all table starts false Offset of numeric ids 0 Add orf stop false Chain overlapping alignments 0 Merge query 1 Search type 0 Number search iterations 1 Start sensitivity 4 Search steps 1 Run a seq-profile search in slice mode false Strand selection 1 Disk space limit 0 MPI runner Force restart with latest tmp false Remove temporary files false LCA ranks Blacklisted taxa 12908,28384 Show taxon lineage false LCA mode 4 Taxonomy output mode 0 Match sequences by their id. false lca_result.dbtype exists already!
MMseqs2 will create a result database in your current working directory.
This database consists of files, whose names start with lca_result
. We
can convert this database into a human readable tab separated values
file (TSV), a common format in bioinformatics:
!mmseqs createtsv reads lca_result lca.tsv
createtsv reads lca_result lca.tsv MMseqs Version: 10.6d92c First sequence as representative false Target column 1 Add full header false Sequence source 0 Database output false Threads 32 Compressed 0 Verbosity 3 Time for merging into lca.tsv by mergeResults: 0h 0m 0s 216ms Time for processing: 0h 0m 0s 527ms
In this file you see for every read a numeric taxonomic identifier, a
taxonomic rank and a taxonomic label. However, due to the large number
of reads, it is hard to gain insight by skimming the file. MMseqs2
offers a module to summarize the data into a single file report.txt
:
!mmseqs taxonomyreport uniprot_sprot lca_result report.txt
taxonomyreport uniprot_sprot lca_result report.txt MMseqs Version: 10.6d92c Threads 32 Compressed 0 Verbosity 3 Loading NCBI taxonomy Loading nodes file ... Done, got 2192766 nodes Loading merged file ... Done, added 55744 merged nodes. Loading names file ... Done Making matrix ... Done Init RMQ ...Done Reading LCA results [=================================================================] 100.00% 410.70K 0s 51ms Found 1297 different taxa for 410696 different reads. 385417 reads are unclassified. Calculating clade counts ... Done Time for processing: 0h 0m 7s 777ms
in the spinal fluid sample?**
MMseqs2 can also generate an interactive visualization of the data using Krona (Ondov, Bergman, and Phillippy 2011). Adapt the previous call to generate a Krona report:
!mmseqs taxonomyreport uniprot_sprot lca_result --report-mode 1 report.html
Usage: mmseqs taxonomyreport <i:targetDB> <i:taxDB> <o:taxonomyReport> [options]
Create Kraken-style taxonomy report.
By Florian Breitwieser <florian.bw@gmail.com>
Options:
Common:
--threads INT number of cores used for the computation (uses all cores by default) [32]
--compressed INT write results in compressed format [0]
-v INT verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info [3]
Unrecognized parameter --report-mode
Did you mean "--threads"?
This generates a HTML
file that can be opened in a browser. This
offers an interactive circular visualization where you can click on each
label to zoom into different parts of the hierarchy.
Look up the following encephalitis causing agents in Wikipedia.
Borrelia bacterium
Herpes simplex virus
Powassan virus
West Nile virus
Mycoplasma
Angiostrongylus cantonensis
number, how would you determine if it is significant evidence for an actual presence of this agent?**
We now want to take a closer look only at the reads of the pathogen. To filter the result database, we will need the pathogen's numeric taxonomic identifier. Use the NCBI Taxonomy Browser to find it, by searching for its name.
more?**
Now we can call a different MMseqs2 module to retrieve only the reads that belong to this pathogen. Replace XXX with the number(s) you just found. If you found multiple, concatenate them with a comma character.
!mmseqs filtertaxdb uniprot_sprot lca_result lca_only_pathogen --taxon-list 11083
filtertaxdb uniprot_sprot lca_result lca_only_pathogen --taxon-list 11083 MMseqs Version: 10.6d92c Compressed 0 Selected taxons 11083 Loading NCBI taxonomy Loading nodes file ... Done, got 2192766 nodes Loading merged file ... Done, added 55744 merged nodes. Loading names file ... Done Making matrix ... Done Init RMQ ...Done Computing LCA [=================================================================] 100.00% 410.70K 0s 128ms ===========================================> ] 78.56% 322.66K eta 0s Time for merging into lca_only_pathogen by mergeResults: 0h 0m 0s 206ms Time for processing: 0h 0m 7s 863ms
We now get a list of all queries that were filtered out, meaning they were annotated as pathogenic.
!grep -Pv '\t1$' lca_only_pathogen.index > pathogenic_read_ids
With a few more commands we can convert our taxonomic labels back into a FASTA file:
!mmseqs createsubdb pathogenic_read_ids reads reads_pathogen
reads_pathogen exists and will be overwritten.
createsubdb pathogenic_read_ids reads reads_pathogen
MMseqs Version: 10.6d92c
Subdb mode 0
Verbosity 3
Time for merging into reads_pathogen by mergeResults: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 36ms
!mmseqs convert2fasta reads_pathogen reads_pathogen.fasta
Database reads_pathogen need header information.
The reads_pathogen_h is missing.
We want to try to recover the protein sequences of the pathogen.
Search the internet.**
We will use the protein assembly method Plass (Steinegger, Mirdita, and Söding 2019) to find overlapping reads and generate whole proteins out of the best matching ones.
!plass assemble reads_pathogen.fasta pathogen_assembly.fasta tmp
Program call: assemble reads_pathogen.fasta pathogen_assembly.fasta tmp MMseqs Version: 2.c7e35 Sub Matrix blosum62.out Rescore mode 0 Remove hits by seq.id. and coverage false E-value threshold 1e-05 Coverage threshold 0 Coverage Mode 0 Seq. Id Threshold 0.9 Seq. Id. Mode 0 Include identical Seq. Id. false Sort results 0 In substitution scoring mode, performs global alignment along the diagonal false No preload false Threads 32 Verbosity 3 Alphabet size 13 Kmer per sequence 60 Mask Residues 0 K-mer size 14 Max. sequence length 65535 Shift hash 5 Split Memory Limit 0 Include only extendable true Skip sequence with n repeating k-mers 8 Min codons in orf 45 Max codons in length 2147483647 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 0 Forward Frames 1,2,3 Reverse Frames 1,2,3 Translation Table 1 Use all table starts false Offset of numeric ids 0 Protein Filter Threshold 0.2 Filter Proteins 1 Number search iterations 12 Remove Temporary Files false Sets the MPI runner Program call: createdb reads_pathogen.fasta /home/chick/work/MMseqs2/tutorial/tmp/8597800890735853730/nucl_reads --max-seq-len 65535 --dont-split-seq-by-len 0 --dont-shuffle 1 --id-offset 0 -v 3 MMseqs Version: 2.c7e35 Max. sequence length 65535 Split Seq. by len false Do not shuffle input database true Offset of numeric ids 0 Verbosity 3 File reads_pathogen.fasta does not exist. Error: createdb failed
Take a look at the generated FASTA file pathogen_assembly.fasta
.
Plass will uncover a lot of variation in the reads and output many similar proteins. We can use the sequence clustering module in MMseqs2 to get only representative sequences.
!mmseqs easy-cluster pathogen_assembly.fasta assembly_clustered tmp
Input pathogen_assembly.fasta does not exist.
You will see three files starting with assembly_clustered
:
assembly_clustered_all_seqs.fasta
assembly_clustered_cluster.tsv
assembly_clustered_rep_seq.fasta
Take a look at the last one assembly_clustered_rep_seq.fasta
. This
file contains all representative sequences, meaning the sequence that
the algorithm chose as the most representative within this cluster.
We will look for known protein domains to identify the proteins we found. Instead of the MMseqs2 command line, we use the MMseqs2 webserver, which will visualize the results. Paste the content of the FASTA file containing the representative sequences into the webserver and make sure to select all three target databases (PFAM, PDB, Uniclust): https://search.mmseqs.com
Despite being able to identify the causative agent. The pathogen is very hard to treat. The patient had minimal neurological recovery and was discharged to an acute care facility on hospital day 30. Seven months after discharge, he was reportedly able to nod his head to questions and slightly move his upper extremities and toes.
You can find the publication about this patient and dataset here (Piantadosi et al. 2017). Please look at it only after trying to answer the questions yourself.
Consortium, UniProt. 2014. "UniProt: A Hub for Protein Information." Nucleic Acids Research 43 (D1): D204--D212.
Magoc, Tanja, and Steven L. Salzberg. 2011. "FLASH: Fast Length Adjustment of Short Reads to Improve Genome Assemblies." Bioinformatics 27 (21): 2957--63.
Ondov, Brian D, Nicholas H Bergman, and Adam M Phillippy. 2011. "Interactive metagenomic visualization in a Web browser." BMC Bioinformatics 12 (1): 385.
Piantadosi, Anne, Sanjat Kanjilal, Vijay Ganesh, Arjun Khanna, Emily P Hyle, Jonathan Rosand, Tyler Bold, et al. 2017. "Rapid Detection of Powassan Virus in a Patient With Encephalitis by Metagenomic Sequencing." Clinical Infectious Diseases 66 (5): 789--92.
Steinegger, Martin, Milot Mirdita, and Johannes Söding. 2019. "Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold." Nature Methods 16 (7): 603--6.
Steinegger, Martin, and Johannes Soeding. 2017. "MMseqs2: Sensitive Protein Sequence Searching for the Analysis of Massive Data Sets." bioRxiv, 079681.
Wood, Derrick E, and Steven L Salzberg. 2014. "Kraken: ultrafast metagenomic sequence classification using exact alignments." Genome Biology 15 (3): R46.