Table of Contents

Introduction

MMseqs2, Linclust and Plass

Your new friends! MMseqs2 ultra fast and sensitive protein search, Linclust clustering huge protein sequence sets in linear time and Plass Protein-level assembly to increases protein sequence recovery from complex metagenomes.

Here you will learn the basic usage of MMseqs2 as well as expert tricks to take advantage of the ability of chaining different MMseqs2 modules to produce custom workflows.

Required software for the tutorials

We will use Conda to setup all required software for this tutorial. If you haven't setup Conda yet, please do so first and then execute:

conda create -n tutorial mmseqs2 plass megahit prodigal hmmer sra-tools
conda activate tutorial

The generic syntax for mmseqs and plass calls is always the following:

mmseqs <command> <db1> [<db2> ...] --flag --parameter 42

The help text of mmseqs shows, by default, only the most important parameters and modules. To see a full list of parameters and modules use the -h flag.

If you are using Bash as your shell, you can activate tab-auto-completion of commands and parameters:

source $CONDA_PREFIX/util/bash-completion.sh

Metagenomic Pathogen Detection

The Patient

A 61-year-old man was admitted in December 2016 with bilateral headache, gait instability, lethargy, and confusion. Because of multiple tick bites in the preceding 2 weeks, he was prescribed the antibiotic doxycycline for presumed Lyme disease. Over the next 48 hours, he developed worsening confusion, weakness, and ataxia. He returned to the referring hospital and was admitted. He lived in a heavily wooded area in New Hampshire, had frequent tick exposures, and worked as a construction contractor in basements with uncertain rodent and bat exposures. His symptoms were diagnosed as Encephalitis and the causative agent --- not known.

  • Your task will be to identify the pathogenic root cause of the disease.

This pathogen is usually confirmed by a screening antibody test, followed by a plaque reduction neutralization test. However, this takes 5 weeks, which was too slow to affect the patient's care. As traditional tests done in the first week of the patient's hospital stay did not reveal any conclusive disease cause, the doctors were running out of options. Therefore a novel metagenomic analysis was performed.

The Dataset

Metagenomic sequencing from cerebrospinal fluid was performed on hospital day 8. It returned 14 million short nucleotide sequences (reads).

The authors of the study removed all human reads using Kraken (Wood and Salzberg 2014) and released a much smaller set of 226,908 reads on the SRA (https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi). Kraken extracts short nucleotide subsequences of length k, also called k-mers, and compares them to a reference database where k-mers point to taxonomic labels. In case of exact matching it is able to assign taxonomy.

  • Why didn't the authors release the complete dataset?
  • Demanding exact k-mer matching in Kraken has benefits for removing human reads. Why?
  • What is the SRA? How many samples are there in the SRA currently? How many bases are publicly available on the SRA in total?

Metagenomic pathogen detection using MMseqs2

We will use the sequence search tool MMseqs2 (Steinegger and Soeding 2017) to find the cause of this patient's disease. MMseqs2 translates the nucleotide reads to putative protein fragments, searches against a protein reference database and assigns taxonomic labels based on the found reference database hits.

  • Why might a protein-protein search be useful for finding bacterial or viral pathogens? How does this compare with Kraken's approach?

Assigning taxonomic labels

To not spoil the mystery to early, we prepared a FASTA file containing the reads. Download this file first:

In [3]:
!wget -c http://wwwuser.gwdg.de/~compbiol/mmseqs2/tutorials/mystery_reads.fasta
--2019-12-02 21:24:11--  http://wwwuser.gwdg.de/~compbiol/mmseqs2/tutorials/mystery_reads.fasta
Resolving wwwuser.gwdg.de (wwwuser.gwdg.de)... 134.76.10.111
Connecting to wwwuser.gwdg.de (wwwuser.gwdg.de)|134.76.10.111|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49930921 (48M)
Saving to: ‘mystery_reads.fasta’

mystery_reads.fasta 100%[===================>]  47.62M  1.15MB/s    in 16s     

2019-12-02 21:24:28 (2.93 MB/s) - ‘mystery_reads.fasta’ saved [49930921/49930921]

The sequencing machine returns paired-end reads where sequencing starts in opposite directions from two close-by points to cover the same genomic region. Some of these paired reads overlap enough to be merged into a single read with FLASH (Magoc and Salzberg 2011).

We will also need a reference database of proteins. For this we will use the Swiss-Prot which is the manually curated, high-quality part of the Uniprot (Consortium 2014) protein reference database. We are using this smaller subset of about 500,000 proteins, since the full Uniprot with over 175,000,000 sequences requires too many computational resources. Each protein in Uniprot has a taxonomic label. Let us prepare this database:

Download the database:

In [1]:
!wget -c http://wwwuser.gwdg.de/~compbiol/mmseqs2/tutorials/uniprot_sprot_2018_03.fasta.gz
!wget -c http://wwwuser.gwdg.de/~compbiol/mmseqs2/tutorials/uniprot_sprot_2018_03_mapping.tsv.gz
!gunzip uniprot_sprot_2018_03_mapping.tsv.gz

Create a taxonomically annotated sequence database

In [7]:
!mmseqs createdb uniprot_sprot_2018_03.fasta.gz uniprot_sprot
uniprot_sprot exists and will be overwritten.
createdb uniprot_sprot_2018_03.fasta.gz uniprot_sprot 

MMseqs Version:              	10.6d92c
Max sequence length          	65535
Split seq. by length         	true
Database type                	0
Do not shuffle input database	true
Offset of numeric ids        	0
Compressed                   	0
Verbosity                    	3

Converting sequences
[556915] 3s 18mss
Time for merging into uniprot_sprot_h by mergeResults: 0h 0m 0s 470ms
Time for merging into uniprot_sprot by mergeResults: 0h 0m 0s 583ms
Time for merging into uniprot_sprot.lookup by mergeResults: 0h 0m 0s 123ms
Time for processing: 0h 0m 5s 311ms
In [8]:
!mmseqs createtaxdb uniprot_sprot tmp --tax-mapping-file uniprot_sprot_2018_03_mapping.tsv
createtaxdb uniprot_sprot tmp --tax-mapping-file uniprot_sprot_2018_03_mapping.tsv 

MMseqs Version:         	10.6d92c
NCBI tax dump directory 	
Taxonomical mapping file	uniprot_sprot_2018_03_mapping.tsv
Threads                 	32
Verbosity               	3

Download taxdump.tar.gz
2019-12-02 21:32:36 URL: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz [50529876] -> "-" [1]
Database created

Through a similarity search we will transfer the annotation of the reference protein to our unknown reads.

In [9]:
!mmseqs createdb mystery_reads.fasta reads
reads exists and will be overwritten.
createdb mystery_reads.fasta reads 

MMseqs Version:              	10.6d92c
Max sequence length          	65535
Split seq. by length         	true
Database type                	0
Do not shuffle input database	true
Offset of numeric ids        	0
Compressed                   	0
Verbosity                    	3

Assuming DNA database, forcing parameter --dont-split-seq-by-len true
Converting sequences
[410667] 0s 692ms
Time for merging into reads_h by mergeResults: 0h 0m 0s 275ms
Time for merging into reads by mergeResults: 0h 0m 0s 297ms
Time for merging into reads.lookup by mergeResults: 0h 0m 0s 94ms
Time for processing: 0h 0m 2s 95ms
In [10]:
!mmseqs taxonomy reads uniprot_sprot lca_result tmp -s 2
taxonomy reads uniprot_sprot lca_result tmp -s 2 

MMseqs Version:                       	10.6d92c
Substitution matrix                   	nucl:nucleotide.out,aa:blosum62.out
Add backtrace                         	false
Alignment mode                        	2
E-value threshold                     	1
Seq. id. threshold                    	0
Min. alignment length                 	0
Seq. id. mode                         	0
Alternative alignments                	0
Coverage threshold                    	0
Coverage mode                         	0
Max sequence length                   	65535
Compositional bias                    	1
Realign hits                          	false
Max reject                            	2147483647
Max accept                            	2147483647
Include identical seq. id.            	false
Preload mode                          	0
Pseudo count a                        	1
Pseudo count b                        	1.5
Score bias                            	0
Gap open cost                         	11
Gap extension cost                    	1
Threads                               	32
Compressed                            	0
Verbosity                             	3
Seed substitution matrix              	nucl:nucleotide.out,aa:VTML80.out
Sensitivity                           	2
K-mer size                            	0
K-score                               	2147483647
Alphabet size                         	21
Max results per query                 	300
Split database                        	0
Split mode                            	2
Split memory limit                    	0
Diagonal scoring                      	true
Exact k-mer matching                  	0
Mask residues                         	1
Mask lower case residues              	0
Minimum diagonal score                	15
Spaced k-mers                         	1
Spaced k-mer pattern                  	
Local temporary path                  	
Rescore mode                          	0
Remove hits by seq. id. and coverage  	false
Sort results                          	0
Mask profile                          	1
Profile e-value threshold             	0.001
Use global sequence weighting         	false
Filter MSA                            	1
Maximum seq. id. threshold            	0.9
Minimum seq. id.                      	0
Minimum score per column              	-20
Minimum coverage                      	0
Select N most diverse seqs            	1000
Omit consensus                        	false
Min codons in orf                     	30
Max codons in length                  	32734
Max orf gaps                          	2147483647
Contig start mode                     	2
Contig end mode                       	2
Orf start mode                        	1
Forward frames                        	1,2,3
Reverse frames                        	1,2,3
Translation table                     	1
Translate orf                         	0
Use all table starts                  	false
Offset of numeric ids                 	0
Add orf stop                          	false
Chain overlapping alignments          	0
Merge query                           	1
Search type                           	0
Number search iterations              	1
Start sensitivity                     	4
Search steps                          	1
Run a seq-profile search in slice mode	false
Strand selection                      	1
Disk space limit                      	0
MPI runner                            	
Force restart with latest tmp         	false
Remove temporary files                	false
LCA ranks                             	
Blacklisted taxa                      	12908,28384
Show taxon lineage                    	false
LCA mode                              	4
Taxonomy output mode                  	0
Match sequences by their id.          	false

lca_result.dbtype exists already!

MMseqs2 will create a result database in your current working directory. This database consists of files, whose names start with lca_result. We can convert this database into a human readable tab separated values file (TSV), a common format in bioinformatics:

In [11]:
!mmseqs createtsv reads lca_result lca.tsv
createtsv reads lca_result lca.tsv 

MMseqs Version:                 	10.6d92c
First sequence as representative	false
Target column                   	1
Add full header                 	false
Sequence source                 	0
Database output                 	false
Threads                         	32
Compressed                      	0
Verbosity                       	3

Time for merging into lca.tsv by mergeResults: 0h 0m 0s 216ms
Time for processing: 0h 0m 0s 527ms

In this file you see for every read a numeric taxonomic identifier, a taxonomic rank and a taxonomic label. However, due to the large number of reads, it is hard to gain insight by skimming the file. MMseqs2 offers a module to summarize the data into a single file report.txt:

In [12]:
!mmseqs taxonomyreport uniprot_sprot lca_result report.txt
taxonomyreport uniprot_sprot lca_result report.txt 

MMseqs Version:	10.6d92c
Threads   	32
Compressed	0
Verbosity 	3

Loading NCBI taxonomy
Loading nodes file ... Done, got 2192766 nodes
Loading merged file ... Done, added 55744 merged nodes.
Loading names file ... Done
Making matrix ... Done
Init RMQ ...Done
Reading LCA results
[=================================================================] 100.00% 410.70K 0s 51ms     

Found 1297 different taxa for 410696 different reads.
385417 reads are unclassified.
Calculating clade counts ...  Done
Time for processing: 0h 0m 7s 777ms
  • What is the most common species in this dataset?
  • Why are there so many different eukaryotic sequences? Were they really in the spinal fluid sample?

Visualizing taxonomic results

MMseqs2 can also generate an interactive visualization of the data using Krona (Ondov, Bergman, and Phillippy 2011). Adapt the previous call to generate a Krona report:

In [15]:
!mmseqs taxonomyreport uniprot_sprot lca_result --report-mode 1 report.html
Usage: mmseqs taxonomyreport <i:targetDB> <i:taxDB> <o:taxonomyReport> [options]

Create Kraken-style taxonomy report.
 By Florian Breitwieser <[email protected]>

Options: 
 Common:         
   --threads INT      number of cores used for the computation (uses all cores by default) [32]
   --compressed INT   write results in compressed format [0]
   -v INT             verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info [3]
Unrecognized parameter --report-mode
Did you mean "--threads"?

This generates a HTML file that can be opened in a browser. This offers an interactive circular visualization where you can click on each label to zoom into different parts of the hierarchy.

Click to open report

What is the pathogen?

Look up the following encephalitis causing agents in Wikipedia.

  1. Borrelia bacterium

  2. Herpes simplex virus

  3. Powassan virus

  4. West Nile virus

  5. Mycoplasma

  6. Angiostrongylus cantonensis

  • Based on the literature, which one is the most likely pathogen?
  • For which species do you find evidence in the metagenomic reads?
  • Approximately how many reads belong to the pathogen? Based on this number, how would you determine if it is significant evidence for an actual presence of this agent?

Investigating the pathogen

We now want to take a closer look only at the reads of the pathogen. To filter the result database, we will need the pathogen's numeric taxonomic identifier. Use the NCBI Taxonomy Browser to find it, by searching for its name.

  • What is the taxon identifier of the pathogen? Did you find one or more?

Now we can call a different MMseqs2 module to retrieve only the reads that belong to this pathogen. Replace XXX with the number(s) you just found. If you found multiple, concatenate them with a comma character.

In [28]:
!mmseqs filtertaxdb uniprot_sprot lca_result lca_only_pathogen --taxon-list 11083
filtertaxdb uniprot_sprot lca_result lca_only_pathogen --taxon-list 11083 

MMseqs Version:	10.6d92c
Compressed     	0
Selected taxons	11083

Loading NCBI taxonomy
Loading nodes file ... Done, got 2192766 nodes
Loading merged file ... Done, added 55744 merged nodes.
Loading names file ... Done
Making matrix ... Done
Init RMQ ...Done
Computing LCA
[=================================================================] 100.00% 410.70K 0s 128ms    ===========================================>             ] 78.56% 322.66K eta 0s       

Time for merging into lca_only_pathogen by mergeResults: 0h 0m 0s 206ms
Time for processing: 0h 0m 7s 863ms

We now get a list of all queries that were filtered out, meaning they were annotated as pathogenic.

In [35]:
!grep -Pv '\t1$' lca_only_pathogen.index > pathogenic_read_ids

With a few more commands we can convert our taxonomic labels back into a FASTA file:

In [36]:
!mmseqs createsubdb pathogenic_read_ids reads reads_pathogen
reads_pathogen exists and will be overwritten.
createsubdb pathogenic_read_ids reads reads_pathogen 

MMseqs Version:	10.6d92c
Subdb mode	0
Verbosity 	3

Time for merging into reads_pathogen by mergeResults: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 36ms
In [37]:
!mmseqs convert2fasta reads_pathogen reads_pathogen.fasta
Database reads_pathogen need header information.
The reads_pathogen_h is missing.
  • How many reads of the pathogen are in this resulting FASTA file?

Assembling reads to proteins

We want to try to recover the protein sequences of the pathogen.

  • Which proteins do you expect to find in the pathogen you discovered? Search the internet.

We will use the protein assembly method Plass (Steinegger, Mirdita, and Söding 2019) to find overlapping reads and generate whole proteins out of the best matching ones.

In [38]:
!plass assemble reads_pathogen.fasta pathogen_assembly.fasta tmp
Program call:
assemble reads_pathogen.fasta pathogen_assembly.fasta tmp 

MMseqs Version:                                                           	2.c7e35
Sub Matrix                                                                	blosum62.out
Rescore mode                                                              	0
Remove hits by seq.id. and coverage                                       	false
E-value threshold                                                         	1e-05
Coverage threshold                                                        	0
Coverage Mode                                                             	0
Seq. Id Threshold                                                         	0.9
Seq. Id. Mode                                                             	0
Include identical Seq. Id.                                                	false
Sort results                                                              	0
In substitution scoring mode, performs global alignment along the diagonal	false
No preload                                                                	false
Threads                                                                   	32
Verbosity                                                                 	3
Alphabet size                                                             	13
Kmer per sequence                                                         	60
Mask Residues                                                             	0
K-mer size                                                                	14
Max. sequence length                                                      	65535
Shift hash                                                                	5
Split Memory Limit                                                        	0
Include only extendable                                                   	true
Skip sequence with n repeating k-mers                                     	8
Min codons in orf                                                         	45
Max codons in length                                                      	2147483647
Max orf gaps                                                              	2147483647
Contig start mode                                                         	2
Contig end mode                                                           	2
Orf start mode                                                            	0
Forward Frames                                                            	1,2,3
Reverse Frames                                                            	1,2,3
Translation Table                                                         	1
Use all table starts                                                      	false
Offset of numeric ids                                                     	0
Protein Filter Threshold                                                  	0.2
Filter Proteins                                                           	1
Number search iterations                                                  	12
Remove Temporary Files                                                    	false
Sets the MPI runner                                                       	

Program call:
createdb reads_pathogen.fasta /home/chick/work/MMseqs2/tutorial/tmp/8597800890735853730/nucl_reads --max-seq-len 65535 --dont-split-seq-by-len 0 --dont-shuffle 1 --id-offset 0 -v 3 

MMseqs Version:              	2.c7e35
Max. sequence length         	65535
Split Seq. by len            	false
Do not shuffle input database	true
Offset of numeric ids        	0
Verbosity                    	3

File reads_pathogen.fasta does not exist.
Error: createdb failed

Take a look at the generated FASTA file pathogen_assembly.fasta.

  • How many sequences were assembled?
  • Do some of the sequences look similar to each other?

Clustering to find representative proteins

Plass will uncover a lot of variation in the reads and output many similar proteins. We can use the sequence clustering module in MMseqs2 to get only representative sequences.

In [39]:
!mmseqs easy-cluster pathogen_assembly.fasta assembly_clustered tmp
Input pathogen_assembly.fasta does not exist.

You will see three files starting with assembly_clustered:

  1. assembly_clustered_all_seqs.fasta

  2. assembly_clustered_cluster.tsv

  3. assembly_clustered_rep_seq.fasta

Take a look at the last one assembly_clustered_rep_seq.fasta. This file contains all representative sequences, meaning the sequence that the algorithm chose as the most representative within this cluster.

  • How many sequences remain now?
  • How well does this agree with what you expected according to your internet search?

Annotating the proteins

We will look for known protein domains to identify the proteins we found. Instead of the MMseqs2 command line, we use the MMseqs2 webserver, which will visualize the results. Paste the content of the FASTA file containing the representative sequences into the webserver and make sure to select all three target databases (PFAM, PDB, Uniclust): https://search.mmseqs.com

  • Which of the expected proteins do you find?

Aftermath

Despite being able to identify the causative agent. The pathogen is very hard to treat. The patient had minimal neurological recovery and was discharged to an acute care facility on hospital day 30. Seven months after discharge, he was reportedly able to nod his head to questions and slightly move his upper extremities and toes.

You can find the publication about this patient and dataset here (Piantadosi et al. 2017). Please look at it only after trying to answer the questions yourself.

References

Consortium, UniProt. 2014. "UniProt: A Hub for Protein Information." Nucleic Acids Research 43 (D1): D204--D212.

Magoc, Tanja, and Steven L. Salzberg. 2011. "FLASH: Fast Length Adjustment of Short Reads to Improve Genome Assemblies." Bioinformatics 27 (21): 2957--63.

Ondov, Brian D, Nicholas H Bergman, and Adam M Phillippy. 2011. "Interactive metagenomic visualization in a Web browser." BMC Bioinformatics 12 (1): 385.

Piantadosi, Anne, Sanjat Kanjilal, Vijay Ganesh, Arjun Khanna, Emily P Hyle, Jonathan Rosand, Tyler Bold, et al. 2017. "Rapid Detection of Powassan Virus in a Patient With Encephalitis by Metagenomic Sequencing." Clinical Infectious Diseases 66 (5): 789--92.

Steinegger, Martin, Milot Mirdita, and Johannes Söding. 2019. "Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold." Nature Methods 16 (7): 603--6.

Steinegger, Martin, and Johannes Soeding. 2017. "MMseqs2: Sensitive Protein Sequence Searching for the Analysis of Massive Data Sets." bioRxiv, 079681.

Wood, Derrick E, and Steven L Salzberg. 2014. "Kraken: ultrafast metagenomic sequence classification using exact alignments." Genome Biology 15 (3): R46.