v0.0.1
Imperial BRC Genomics Facility
Imperial BRC Genomics Facility
2020-April-21 14:42
Download the aligned corona virus genomes from Genexa
## Downloading aligned fasta format
!wget https://storage.googleapis.com/sars-cov-2/MSA_SARS2_20200329.fasta
--2020-04-02 18:49:21-- https://storage.googleapis.com/sars-cov-2/MSA_SARS2_20200329.fasta Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.129.128, 2607:f8b0:4001:c03::80 Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.129.128|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 4668949 (4.5M) [application/octet-stream] Saving to: 'MSA_SARS2_20200329.fasta' MSA_SARS2_20200329. 100%[===================>] 4.45M --.-KB/s in 0.06s 2020-04-02 18:49:21 (70.8 MB/s) - 'MSA_SARS2_20200329.fasta' saved [4668949/4668949]
## DOwnloading data in Clustal format
!wget -O MSA_SARS2_20200329.aln https://storage.googleapis.com/sars-cov-2/MSA_SARS2_20200329.clw
--2020-04-02 19:10:59-- https://storage.googleapis.com/sars-cov-2/MSA_SARS2_20200329.clw Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.214.128, 2607:f8b0:4001:c07::80 Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.214.128|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 5932931 (5.7M) [application/octet-stream] Saving to: 'MSA_SARS2_20200329.aln' MSA_SARS2_20200329. 100%[===================>] 5.66M --.-KB/s in 0.1s 2020-04-02 19:10:59 (57.8 MB/s) - 'MSA_SARS2_20200329.aln' saved [5932931/5932931]
!FastTree -nt -gtr < MSA_SARS2_20200329.fasta >MSA_SARS2_20200329.tree
FastTree Version 2.1.10 Double precision (No SSE3) Alignment: standard input Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000 Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits: 1.00*sqrtN close=default refresh=0.80 ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories Ignored unknown character D (seen 1 times) Ignored unknown character H (seen 13 times) Ignored unknown character K (seen 1 times) Ignored unknown character R (seen 1 times) Ignored unknown character S (seen 3 times) Ignored unknown character V (seen 1 times) Ignored unknown character W (seen 3 times) Ignored unknown character X (seen 42 times) Ignored unknown character Y (seen 17 times) Initial topology in 2.67 seconds0 of 123 126 seqs (at seed 100) Refining topology: 28 rounds ME-NNIs, 2 rounds ME-SPRs, 14 rounds ML-NNIs Total branch-length 0.010 after 19.69 sec, 101 of 124 splits, 0 changes ax delta 0.000) WARNING! This alignment consists of closely-related and very-long sequences. WARNING! FastTree (or other standard maximum-likelihood tools) may not be appropriate for aligments of very closely-related sequences like this one, as FastTree does not account for recombination or gene conversion ML-NNI round 1: LogLk = -45578.360 NNIs 57 max delta 6.62 Time 27.53ges (max delta 6.617) GTR Frequencies: 0.2990 0.1837 0.1962 0.3211ep 12 of 12 GTR rates(ac ag at cg ct gt) 0.4703 1.4001 0.5361 0.6047 2.2160 1.0000 Switched to using 20 rate categories (CAT approximation)20 of 20 Rate categories were divided by 0.626 so that average rate = 1.0 CAT-based log-likelihoods may not be comparable across runs Use -gamma for approximate but comparable Gamma(20) log-likelihoods ML-NNI round 2: LogLk = -44402.405 NNIs 34 max delta 0.00 Time 89.79ges (max delta 0.000) Turning off heuristics for final round of ML NNIs (converged) ML-NNI round 3: LogLk = -44401.762 NNIs 27 max delta 0.55 Time 104.20 (final)delta 0.546) Optimize all lengths: LogLk = -44401.740 Time 107.77 Total time: 128.59 seconds Unique: 126/153 Bad splits: 0/123rnal splits
import os
import requests
import json
import time
from ete3 import Tree, TreeStyle,SeqMotifFace
from PIL import Image
os.environ['QT_QPA_PLATFORM']='offscreen'
t = Tree("MSA_SARS2_20200329.tree")
## plain tree
t.render("%%inline")
## circular tree
ts = TreeStyle()
ts.mode = "c"
#ts.scale = 20
t.render("%%inline",tree_style=ts)
## tree with branch length
ts = TreeStyle()
ts.show_leaf_name = True
ts.show_branch_length = True
ts.show_branch_support = True
t.render("%%inline",tree_style=ts)
## 180 deg circular tree
ts = TreeStyle()
ts.show_leaf_name = True
ts.mode = "c"
ts.arc_start = -180 # 0 degrees = 3 o'clock
ts.arc_span = 180
t.render("180_deg_circular_tree.png",tree_style=ts)
t.render("%%inline",tree_style=ts)
## read aligned fastq file
fasta_data = dict()
with open('MSA_SARS2_20200329.fasta','r') as fp:
header = ''
fasta_list = list()
for line in fp:
line = line.strip()
if line.startswith('>'):
if header != '':
fasta_data.update({header:''.join(fasta_list)})
header = line.split()[0].replace('>','')
fasta_list = list()
else:
fasta_list.append(line)
fasta_data.update({header:''.join(fasta_list)})
## plot tree with alignment
for seq_id,seq in fasta_data.items():
seqFace = SeqMotifFace(seq, gapcolor="red")
(t & "{0}".format(seq_id)).add_face(seqFace, 0, "aligned")
ts = TreeStyle()
ts.tree_width = 100
t.render("tree_with_aln.png", tree_style=ts);
Jaime Huerta-Cepas, Francois Serra and Peer Bork. Mol Biol Evol 2016; doi: 10.1093/molbev/msw046