blasr
and (I)Python
¶I am not sure it is right thing to do some coding in a vacation. (See
this tweet, not sure if I would agree.)
Anyway, before the vacation, I decided to organize whole bunch of random papers
I collected in the last few years in my laptop. I eventually felt that I should
read some of some theoretical papers about genome assembly again. I grabbed
Gene Meyer's paper,
and started to think about the problem about constructing unitigs
(high-confident contigs). Before I tried to read the paper in detail, I just
felt maybe that it was useful to write some quick code to check out what real
data looked like. I started writing some code visualizing the local overlapping
and generating the global overlapping graph. It was pretty straight forward and quite
inspiring from the visualization. The visualization motived me to write
more ad-hoc code in IPython to go beyond generating simple visualization during the
vacation. I started to implement a very simple greedy algorithm to connect the
input DNA fragments. Eventually, I found I can atucally get the whole genome
assembly right!! This Notebook shows the work step by step toward a simple genome
assembler for long read data using IPython
and Python
.
The input data was the E. coli K12 data from PacBio(R) DevNet Site. I generated a new set of pre-assembled reads as the input DNA fragment using HBAR-DTK. The statistics of the input reads is shown below:
Pre-assembled Read Stats
#Seqs 17497
Min 448
1st Qu. 5365
Median 6149
Mean 6193
3rd Qu. 7189
Max 14762
Total 108363627
n50 6521
n90 4694
n95 3864
This is about 23x of average length of 6.2kb reads of E. coli K12 genome. The mean sequence identity when aligning the pre-assembled reads to the canonical E. coli K12 reference is 99.787% (median identity = 99.921%). The pre-assembled read data is available for download from this link to pre_assembled_read.fa. According the Lander-Waterman statistics, this set of reads is enough to get single contig for the single chromosome. The only thing one needs to worry about is that whether the repeats in genome can be uniquely. Since it is shown one can use Celera Assembler to assemble this data set into one single contig, this implies that all of the repeats of the rRNA operon can be resolved from this data set. Typically, one might need multiple libraries, especially those long jumping libraries, to resolve long repeats from short read (~100bp) data. For example, ALLPATHS-LG can use multiple short-read libraries and long read data to get perfect assemblies for bacteria Paper ALLPATHS-LG Blog. However, when one can get long range information from one single library, the assembly process can be probably greatly simplified. As demonstrated here, one can actually get pretty good assembly using Python code with blasr for overlapping and Quiver for polishing the residual errors.
For long read data, the de Bruijn graph with some sophisticated graph operations to get around the limit of short read data to assemble genomes may not be the right approach. The Overlap-Layout-Consensus can be much more efficient for assembling long reads. Also, when the reads are getting longer, the overlapping between the reads becomes more and more specific. This will significantly reduce the complexity of the overlapping graph between the reads. (Here we define an overlapping graph as a graph where the nodes are reads and the edges are constructed from the pairs of proper overlapped reads.) When the overlapping graph is simple, it becomes possible to use some naive algorithm to get a reasonably good assembly. To make this clear, one can image if the read lengths are as long as the genome, the number of overlapping one needs to worry about will be in single digit and there will be no ambiguity at all. The overlapping and layout will be trivial in such extreme case.
When I use Celera Assembler (CA)
to assemble genomes, I always try to peek the
intermediate data in order to understand the process better and maybe fix few
things manually. Michael Schatz (@mike_schatz)
and Adam Phillippy both pointed to me
some very useful information about CA. (Thanks, Michael and Adam.)
tigStore
is a very useful command to see the relationship between the
generated unitigs and the initial DNA fragments. The file
4-unitigger/best.edges
contains the information best overlapped reads which
one can use to construct the overlapping graph and one can learn a lot subtle
points about genome assembly by visualizing those. This notebook is definitely
inspired by the great work done by the CA's developers.
This IPython notebook is just a "Notebook". I write these code for fun and for
(self-)educational purposes. It shows that given the good long read data and
the proper tools blasr
,
pbdagcon
,
HBAR-DTK
, and
quiver
one can learn how to
do genome assembly with a "not-so-high-performance" language Python
. The great
IPython Notebook
allows me to try out different
code snippet and different algorithm ideas without a lot of overhead. One
should notice this code in this IPython Notebook
is mainly for fun, not for
"production". I chose the least resistant path for coding. Coding style and
code performance were considered but not taking seriously as I was doing a lot
of experiments for different algorithm approaches when I coded this notebook.
I consider this Notebook is as a personal project for fun. The data and the used software can be all downloaded publicly. Although I am an employee of PacBio(R), this notebook is developed in my own time. PacBio will not be responsible for any information associated with this Notebook.
--Jason Chin, Apr 2, 2013
I put the pre-assembled read file in Figshare. Here is the link to the pre_assembled_read.fa. For most of part of the Notebook, one does not need the fasta file. You can also visit the project data page to see other data files and this code.
The first step to assemble the genome is to use blasr
to find the overlapping between the reads. This can be done by executing this commend:
blasr pre_assembled_reads.fa pre_assembled_reads.fa -m 4 -nproc 32 -bestn 32 -maxLCPLength 15 -nCandidates 36 -out pr_pr_bn32.m4 -noSplitSubreads
We ask blasr
to output the coordinates of the alignment between the reads by specified the -m 4
option.
The -nproc 32
will let blasr
to use 32 threads. For each reads, we allow blasr
to find the other 32 best overlaps by using -bestn 32
and scaning 36 potential hit candidateswith -nCandidates 36
. The -noSplitSubreads
has only cosmatic effect here. It makes the ouput query identifier cleaner.
blasr
is a general purpose aligner. It is not really tuned to do overlapping work. One will see sometimes it does not handle the ends of the alignment as one would like it to do. The containment_tolerance
variables used below was introduced for being less picky at the ends of the alignment. However, one would need to worry about potential mis-assemblies if the tolerance is too big.
The aligned results can be found in the FigShare project page mentioned above. Or, one can download it from this link.
blasr -m 4
" output format¶The cells below show the content of the m4
file. It containes the alignment information between the reads. The fields are query identifier
, target identifier
, alignment score (more negative -> better alignment)
, alignment identity
, query strand
, query start
, query end
, query length
, target strand
, target start
, target end
, target length
. The rest of the fields are not used. One trap here is that the "target start
" and "target end
" are not relative to the original sequence if "target strand
" is 1
. When "target strand= 1
", it means that the reverse-complimentary strand is aligned and the "target start
" and "target end
" are the coordinates in the reverse complimentary strand. In the stand blasr
output, "query strand
" is always 0
.
!head -10 pr_pr_bn32.m4
0xc4b462ba_0096978 0xc4b462ba_0096978 -14495 100 0 0 2899 2899 0 0 2899 2899 50 60847 -2716.48 0 21 0xc4b462ba_0096978 0x79c77458_0006261 -14400 99.4505 0 0 2899 2899 1 2502 5411 10399 50 60890 -2683.04 0 21 0xc4b462ba_0096978 0x72a8c4f1_0018952 -14395 99.4164 0 0 2899 2899 0 483 3393 5755 50 60893 -2658.33 0 21 0xc4b462ba_0096978 0x4dd7c9d5_0133333 -14390 99.4162 0 0 2899 2899 1 2718 5626 6023 50 60894 -2683.04 0 21 0xc4b462ba_0096978 0x726e8550_0094067 -14385 99.3821 0 0 2899 2899 0 27 2936 7163 50 60897 -2658.76 0 21 0xc4b462ba_0096978 0x833174a6_0078354 -14385 99.3821 0 0 2899 2899 1 2676 5585 6916 50 60897 -2683.28 0 21 0xc4b462ba_0096978 0xc52dd432_0103092 -14380 99.3819 0 0 2899 2899 1 2565 5472 6025 50 60900 -2666.7 0 21 0xc4b462ba_0096978 0x48e2fc97_0054635 -14375 99.3477 0 0 2899 2899 0 2621 5529 7539 50 60902 -2658.24 0 21 0xc4b462ba_0096978 0xea0a0bab_0128842 -14375 99.3477 0 0 2899 2899 0 7684 10592 11923 50 60901 -2657.5 0 21 0xc4b462ba_0096978 0x4949e7c9_0060870 -14360 99.3132 0 0 2899 2899 1 2927 5832 6621 50 60907 -2666.12 0 21
blasr
's output to python internal data¶We deinfe read_node
class for the objects that holds the local overlapping information of each read. It simply holds a python
dictionary (hash_map) where the key is the target identifier
and the elements from m4
line are converted to the right types and stored. Only "non-contaminated"" alignments are stored. If one read is full aligned within the other read, such alignment is not stored in a read_node
. The get_overlap_data
function processes the m4
file and determine whether the alignments are contaminated
with some tolerence of the ends of the alignments.
import sys
import os
class read_node_0: #experimenting on keeping the data in a list rather than a dictionary, not used
def __init__(self, name, length):
self.name = name
self.length = length
self.hits = []
self.hit_map = None
def add_hit(self, aln_info):
self.hits.append ( aln_info )
def __getitem__(self, target_name):
if self.hit_map == None:
self.hit_map = dict(zip( [h[0] for h in self.hits], self.hits ) )
return self.hit_map[target_name]
class read_node:
def __init__(self, name, length):
self.name = name
self.length = length
self.hit_map = {}
def add_hit(self, aln_info):
self.hit_map[aln_info[0]] = aln_info
def __getitem__(self, target_name):
return self.hit_map[target_name]
@property
def hits(self): #another convenient representation of the same data
return self.hit_map.values()
def get_overlap_data(m4_filename):
containment_tolerance = 50
permitted_error_pct = 2
overlap_data = {}
contained_reads = set()
with open(m4_filename) as m4_f:
for l in m4_f:
l = l.strip().split()
q_name, t_name =l[0:2]
if q_name == t_name:
continue
aln_score = int(l[2])
aln_idt = float(l[3])
if aln_idt < 100-permitted_error_pct:
continue
q_strand, q_start, q_end, q_len = ( int(x) for x in l[4:8])
t_strand, t_start, t_end, t_len = ( int(x) for x in l[8:12])
if q_len - (q_end - q_start) < containment_tolerance:
contained_reads.add(q_name)
if t_len - (t_end - t_start) < containment_tolerance:
contained_reads.add(t_name)
if q_name not in overlap_data:
overlap_data[ q_name ] = read_node(q_name, q_len)
assert q_strand == 0
if t_name not in [x[0] for x in overlap_data[ q_name ].hits]:
overlap_data[ q_name ].add_hit( (t_name,
aln_score,
aln_idt,
(q_strand, q_start, q_end, q_len),
(t_strand, t_start, t_end, t_len) ) )
#symmetrized the alignment record
if t_name not in overlap_data:
overlap_data[ t_name ] = read_node(t_name, t_len)
if q_name not in [x[0] for x in overlap_data[ t_name ].hits]:
if t_strand == 1:
t_start, t_end = t_len - t_end, t_len - t_start
q_start, q_end = q_len - q_end, q_len - q_start
overlap_data[ t_name ].add_hit( (q_name,
aln_score,
aln_idt,
(0, t_start, t_end, t_len),
(1, q_start, q_end, q_len) ) )
else:
overlap_data[ t_name ].add_hit( (q_name,
aln_score,
aln_idt,
(0, t_start, t_end, t_len),
(0, q_start, q_end, q_len) ) )
return overlap_data, contained_reads
overlap_data, contained_reads = get_overlap_data("pr_pr_bn32.m4")
The generate_overlap_gml
takes the overlap_data
and the contained_reads
to generate a overlap graph in GML
format. (You will need to install networkx
in your Python
environment to use this code.)
def generate_overlap_gml(overlap_data, contained_reads, gml_filename):
containment_tolerance = 50
permitted_error_pct = 2
import networkx as nx
G=nx.DiGraph()
node_in_graph = set()
for q_name in overlap_data:
if q_name in contained_reads:
continue
if q_name not in node_in_graph:
G.add_node(q_name)
targets = overlap_data[ q_name ].hits
targets_3prime = [ h for h in targets if h[4][1] < containment_tolerance and h[0] not in contained_reads]
targets_5prime = [ h for h in targets if h[3][1] < containment_tolerance and h[0] not in contained_reads]
targets_3prime.sort(key = lambda k:k[1])
targets_5prime.sort(key = lambda k:k[1])
if len(targets_3prime) > 0:
t = targets_3prime[0]
t_name = t[0]
if t_name not in node_in_graph:
G.add_node(t_name)
G.add_edge(q_name, t_name)
if len(targets_5prime) > 0:
t = targets_5prime[0]
t_name = t[0]
if t_name not in node_in_graph:
G.add_node(t_name)
G.add_edge(q_name, t_name)
nx.write_gml(G, gml_filename)
generate_overlap_gml(overlap_data, contained_reads, "overlap_graph.gml")
Ther generated overlap_graph.gml can be visualized by Gephi. One can use "YifanHu's Multilevel" graph layout algorithm to do the rough layout followed by "ForceAtlas 2" algorithm to smooth the graph. The cell below shows the global overlapping structure. One can see that it is topologically a circle plus a number of singletons. Namely, the read data does capture the whole genome in one single circular layout.
from IPython.display import Image
Image(filename = "overlap_graph01.png")
We can also see some local "bubbles" in the graph.
Image(filename = "overlap_graph02.png")
Image(filename = "overlap_graph03.png")
SVG
files¶The cell below defines some contants and utility funtions so we can generate SVG
to show the local overlapping alignments.
arrow_defs = """<defs>
<marker id="Triangle_green" stroke="green" fill="green"
viewBox="0 0 10 10" refX="0" refY="5"
markerUnits="strokeWidth"
markerWidth="4" markerHeight="3"
orient="auto">
<path d="M 0 0 L 10 5 L 0 10 z" />
</marker>
<marker id="Triangle_blue" stroke="blue" fill="blue"
viewBox="0 0 10 10" refX="0" refY="5"
markerUnits="strokeWidth"
markerWidth="4" markerHeight="3"
orient="auto">
<path d="M 0 0 L 10 5 L 0 10 z" />
</marker>
<marker id="Triangle_black" stroke="black" fill="black"
viewBox="0 0 10 10" refX="0" refY="5"
markerUnits="strokeWidth"
markerWidth="4" markerHeight="3"
orient="auto">
<path d="M 0 0 L 10 5 L 0 10 z" />
</marker>
<marker id="Triangle_red" stroke="red" fill="red"
viewBox="0 0 10 10" refX="0" refY="5"
markerUnits="strokeWidth"
markerWidth="4" markerHeight="3"
orient="auto">
<path d="M 0 0 L 10 5 L 0 10 z" />
</marker>
</defs>"""
def svg_arrow( x1, y1, x2, y2, col, w):
return """<g stroke = "%s" stroke-width = "%f" fill = "none">\
<path d = "M %f %f L %f %f" marker-end="url(#Triangle_%s)"/> </g>""" % (col, w, x1, y1, x2, y2, col)
def svg_line( x1, y1, x2, y2, col, w):
return """<g stroke = "%s" stroke-width = "%f" fill = "none">\
<path d = "M %f %f L %f %f"/> </g>""" % (col, w, x1, y1, x2, y2)
class Overlap_SVG_view:
def __init__(self):
self.header = """<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg xmlns:svg="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg" version="1.1" width="900px" height="300px">
"""
self.footer = """</svg>"""
self.elements = []
self._svg_data = None
def add_element(self, elm):
self.elements.append(elm)
def _construct_SVG(self):
SVG_str = []
SVG_str.append( self.header )
for elm in self.elements:
SVG_str.append( elm )
SVG_str.append( self.footer )
self._svg_data = "\n".join(SVG_str)
self._svg_data = self._svg_data.decode('utf-8')
def output(self, filename):
if self._svg_data is None:
self._construct_SVG()
with open(filename, "w") as out_f:
print >> out_f, self._svg_data
def _repr_svg_(self):
if self._svg_data is None:
self._construct_SVG()
return self._svg_data
@property
def svg(self):
return SVG(self._repr_svg_())
generate_overlap_view
: a function taking the overlapping data to generate a SVG
file¶The generate_overlap_view
simple does several coordinate transformations to create a SVG
file so we can see the local overlap around a given DNA fragment. The blank arrow indicates the fragment in interesting. Green and blue arrows indicate the forward and reversed proper overlapped alignments. The red arrows indicate containment alignments or improper alignments (locally inconsistent alignments due to repeats, mis-mapping or sequencing artifacts).
from IPython.display import SVG
import os
def generate_overlap_view(overlap_data, q_name):
containment_tolerance = 10
hits = overlap_data[q_name].hits
SV = Overlap_SVG_view()
SV.add_element(arrow_defs)
x_offset = 10000
x_scaling = 0.02
x1 = 0
x2 = overlap_data[q_name].length
x1 += x_offset
x2 += x_offset
x1 *= x_scaling
x2 *= x_scaling
y = 200
SV.add_element(svg_arrow(x1, y, x2, y, "black", 4))
hits.sort(key = lambda k: (k[3][1],k[3][3],k[3][2]))
for h in hits:
t_name, t_score, t_idt = h[0], h[1], h[2]
if t_name in contained_reads:
continue
q_strand, q_start, q_end, q_len = h[3]
t_strand, t_start, t_end, t_len = h[4]
y -=10
x1 = q_start
x2 = q_end
col = None
if q_start < containment_tolerance and t_len - t_end < containment_tolerance:
x1 = -t_start
elif q_len - q_end < containment_tolerance and t_start < containment_tolerance:
x2 = q_end + (t_len - t_end)
else:
col = "red"
if t_strand == 1:
x1, x2 = x2, x1
print h
if t_strand == 0 and col == None:
col = "green"
elif col == None:
col = "blue"
x1, x2 = x2, x1
#print h
#print min(x1, x2), max(x1, x2), abs(x1-x2)
x1 += x_offset
x2 += x_offset
x1 *= x_scaling
x2 *= x_scaling
SV.add_element(svg_arrow(x1, y, x2, y, col, 3))
return SV
q_name = overlap_data.keys()[0]
OSV = generate_overlap_view(overlap_data, q_name)
display(OSV.svg)
('0x9ba164ba_0123315', -32790, 99.8631, (0, 0, 6571, 6571), (0, 38, 6610, 7974))
Given that we see a beautiful circle overlapping graph, we like to see if can choose a seeding frgament and extend the overlapping to form a contigs.
The code below scans through the overlap_data
and assign the best score alignments for both 3' and 5' of a read as the best partner. The return of the get_best_overlap_graph
is a dictionary where the key is a fragment identifier and the value is the alignment information of the 3' and 5' best overlapping alignments.
def filter_overlap_hit(overlap_hit):
containment_tolerance = 50
q_strand, q_start, q_end, q_len = overlap_hit[3]
t_strand, t_start, t_end, t_len = overlap_hit[4]
if q_len - (q_end - q_start) < containment_tolerance and t_len - (t_end - t_start) / t_len < containment_tolerance:
return False
elif overlap_hit[0] in contained_reads:
return False
else:
return True
def get_best_overlap_graph(overlap_data):
best_overlap_graph = {}
containment_tolerance = 50
for q_name in overlap_data:
if q_name in contained_reads:
continue
if q_name not in best_overlap_graph:
best_overlap_graph[q_name] = {"5p":None, "3p":None}
targets = overlap_data[ q_name ].hits
targets_5prime = [ h for h in targets if h[3][1] < containment_tolerance and filter_overlap_hit(h)]
targets_3prime = [ h for h in targets if h[4][1] < containment_tolerance and filter_overlap_hit(h)]
targets_5prime.sort(key = lambda k:k[1])
targets_3prime.sort(key = lambda k:k[1])
if len(targets_5prime) > 0:
best_overlap_graph[q_name]["5p"] = targets_5prime[0]
t_name = targets_5prime[0][0]
if len(targets_3prime) > 0:
best_overlap_graph[q_name]["3p"] = targets_3prime[0]
t_name = targets_3prime[0][0]
return best_overlap_graph
get_best_overlap_graph
¶The cell below find the best alignment from the read of the first key in the best_overlap_graph
. It shows the best 3' and 5' partners.
best_overlap_graph = get_best_overlap_graph(overlap_data)
q_name = best_overlap_graph.keys()[0]
print q_name, "has best partners:", best_overlap_graph[q_name]
0xf6bb6a22_0085668 has best partners: {'3p': ('0xdcf35477_0042197', -37729, 99.9206, (0, 1032, 8586, 8586), (1, 0, 7557, 9323)), '5p': ('0xd9132340_0068296', -40160, 99.373, (0, 0, 8128, 8586), (1, 436, 8525, 8525))}
The two functions defined below allows us to extend the overlapping through the chain of overlapped fragments. The extension is terminated when (1) some inconsistent overlapping is detected (2) the overlapped fragment is already used by other contigs (3) no overlapped fragment is found and (4) a circle is detected in the path. Such rules may not be the best way to generate unitigs but they are very simple to implement. At least, these rules do work well for this E. coli data set. One can image some more complicated scenarios that it will be necessary to refine these rules or add more rules to get the best results and avoid mis-assemblies.
def proper_overlap_hit(overlap_hit):
containment_tolerance = 50
q_strand, q_start, q_end, q_len = overlap_hit[3]
t_strand, t_start, t_end, t_len = overlap_hit[4]
if q_start < containment_tolerance:
if t_len - t_end > containment_tolerance:
return False
else:
return True
if t_start < containment_tolerance:
if q_len - q_end > containment_tolerance:
return False
else:
return True
def find_path( q_name_end, frag_used = set() ):
reverse_end = {"3p":"5p", "5p":"3p"}
path = []
path_q_name = set()
q_name, end = q_name_end
if end == "5p":
reversing_path = True
else:
reversing_path = False
while 1:
if q_name in frag_used:
if reversing_path:
path.reverse()
return path, "frag_used"
path.append( (q_name, end) )
path_q_name.add(q_name)
if q_name not in best_overlap_graph:
if reversing_path:
path.reverse()
return path, "terminate_1"
next_hit = best_overlap_graph[q_name][end]
#print next_hit
if next_hit == None:
if reversing_path:
path.reverse()
return path, "terminate_2"
if next_hit[0] in best_overlap_graph: #if not mutual good hits, break the path
# Using mutual best hit might be to strigent,
# bh = []
#if best_overlap_graph[next_hit[0]]["5p"]:
# bh.append( best_overlap_graph[next_hit[0]]["5p"][0] )
#if best_overlap_graph[next_hit[0]]["3p"]:
# bh.append( best_overlap_graph[next_hit[0]]["3p"][0] )
bh = [h[0] for h in overlap_data[next_hit[0]].hits if proper_overlap_hit(h)]
if q_name not in bh:
if reversing_path:
path.reverse()
return path, "branch"
q_name = next_hit[0]
if q_name in path_q_name:
if reversing_path:
path.reverse()
return path, "circle"
if next_hit[4][0] == 1: #reversed strand
end = reverse_end[end]
Here we show how to call the find_path
function to get the overlapping path for both 5' and 3' from a seed fragment. We pick the longest fragment in the data set as the seed. One will notice that the total fragments in from both 5'-parh and 3'-path are greater than the total number of fragments in the best_overlap_graph
. It mean that there are significant overlapping between the 5'-path and 3'-path. This is consistent with the genome is circular.
len_qname = [ (overlap_data[x].length, x) for x in best_overlap_graph.keys() ]
len_qname.sort()
q_name = len_qname[-1][1]
#q_name = "0xa4919d66_0061381"
print "The longest fragment is", q_name
path_5p, s_5p = find_path( (q_name, "5p") )
path_3p, s_3p = find_path( (q_name, "3p") )
print "The number of the fragment of the 5' path is", len(path_5p)
print "The number of the fragment of the 3' path is", len(path_3p)
print "The total number of framgent of both 3' and 5' path is %d." % (len(path_3p)+len(path_5p)-1)
print "%d is greater than the total number of fragments %d in the best_overlap_graph." % (len(path_3p)+len(path_5p)-1, len(best_overlap_graph))
print "The begin of the 5'-path is", path_5p[0]
print "The end of the 3'-path is", path_3p[-1]
The longest fragment is 0x4eaa5fa7_0001083 The number of the fragment of the 5' path is 2221 The number of the fragment of the 3' path is 2158 The total number of framgent of both 3' and 5' path is 4378. 4378 is greater than the total number of fragments 3390 in the best_overlap_graph. The begin of the 5'-path is ('0xa4919d66_0061381', '3p') The end of the 3'-path is ('0x24a73353_0082329', '3p')
Since we get the paths that seems covering the whole genome, we can use the paths to construct draft contigs. In order to do that, we need to get the fragment sequences.
seq_db = {}
with open("pre_assembled_reads.fa") as seq_file:
for l in seq_file:
l = l.strip()
if l[0] == ">":
name = l[1:]
continue
else:
seq_db[name] = l
The layout_path
function goes through a layout path and fill in the base from the fragment sequences. Nothing is really special here. However, it is alwaye tricky to track the right orientation of the fragements. The function also tracks the fragments used in the layout. Each fragement can only be assigned to one layout/contig.
def rc_seq(seq):
rev_map = dict(zip("acgtACGTNn-","tgcaTGCANn-"))
return "".join([rev_map[c] for c in seq[::-1]])
def layout_path(full_path, frag_used, out_fn, tig_name):
if len(full_path) == 0 or full_path[0][0] in frag_used:
return None
if len(full_path) == 1:
with open("singleton_"+out_fn, "w") as out_seq_file:
seq = seq_db[full_path[0][0]]
print >>out_seq_file, ">%s" % ( "singleton_" + tig_name )
print >>out_seq_file, seq
frag_used.add(full_path[0][0])
return None
first = full_path[0]
offset = 0
current_orientation = first[1]
revserse_orientation = {"5p":"3p", "3p":"5p"}
tig_seq = []
frag_in_layout = set()
frag_in_layout.add(first[0])
for second in full_path[1:]:
overlap_hit = overlap_data[first[0]][second[0]]
q_strand, q_start, q_end, q_len = overlap_hit[3]
t_strand, t_start, t_end, t_len = overlap_hit[4]
#print overlap_hit, offset, current_orientation
seq = seq_db[first[0]]
if current_orientation == "3p":
seq = rc_seq(seq)
del tig_seq[offset:-1]
tig_seq.extend(seq)
if current_orientation == "5p":
offset += q_start
elif current_orientation == "3p":
offset += q_len - q_end
if t_strand == 1:
current_orientation = revserse_orientation[current_orientation]
frag_in_layout.add(first[0])
first = second
if second[0] in frag_in_layout:
break
if second[0] in frag_used:
break
seq = seq_db[first[0]]
if current_orientation == "3p":
seq = rc_seq(seq)
del tig_seq[offset:-1]
tig_seq.extend(seq)
frag_in_layout.add(first[0])
frag_used.update( frag_in_layout )
tig_seq = tig_seq[0:offset+len(seq)]
tig_seq = "".join(tig_seq)
with open(out_fn, "w") as out_seq_file:
print >>out_seq_file, ">%s" % tig_name
print >>out_seq_file, tig_seq
Ok, ready for the primer time!!
The code below will pick the longest fragment as the first seed and generate a contig from its overlapping path. Then, we will pick another longest one from those reads that have not been used in a contig. This process repeats until no more fragement is avaiable.
In the data set, we generate one big contig (4650011 bp) and 49 singleton reads.
frag_used = set()
all_best_overlap_frags = set(best_overlap_graph.keys())
unused_frag = all_best_overlap_frags - frag_used
i = 0
while len(unused_frag) != 0:
len_qname = [ (overlap_data[x].length, x) for x in unused_frag ]
len_qname.sort()
q_name = len_qname[-1][1]
print "iteration: ",i
print "frag is used?", q_name in frag_used
if q_name in frag_used:
continue
path_5p, s_5p = find_path( (q_name, "5p"), frag_used )
path_3p, s_3p = find_path( (q_name, "3p"), frag_used )
print "seeding frag:", q_name, "Length:", overlap_data[q_name].length
print "number of unused frag:", len(unused_frag), "total overlapped frag:", len(path_3p)+len(path_5p)
print len(path_5p), s_5p, len(path_3p), s_3p
print "--------------------------"
#print path_5p[0], path_3p[-1]
if len(path_5p) + len(path_3p) == 0:
out_fn = "tig_%05d.fa" % i
tig_name ="tig%05d" % i
with open("singleton_"+out_fn, "w") as out_seq_file:
seq = seq_db[q_name]
print >>out_seq_file, ">%s" % ( "singleton_" + tig_name )
print >>out_seq_file, seq
frag_used.add(q_name)
unused_frag = all_best_overlap_frags - frag_used
i += 1
continue
if len(path_5p) > 0:
#assert path_5p[-1] == path_3p[0]
full_path = path_5p + path_3p[1:]
else:
full_path = path_3p
layout_path(full_path, frag_used, "tig_%05d.fa" % i, "tig%05d" % i)
unused_frag = all_best_overlap_frags - frag_used
i += 1
iteration: 0 frag is used? False seeding frag: 0x4eaa5fa7_0001083 Length: 14762 number of unused frag: 3390 total overlapped frag: 4379 2221 terminate_2 2158 terminate_2 -------------------------- iteration: 1 frag is used? False seeding frag: 0x720a6867_0101011 Length: 9562 number of unused frag: 64 total overlapped frag: 2 1 frag_used 1 terminate_2 -------------------------- iteration: 2 frag is used? False seeding frag: 0xbb3ff171_0040732 Length: 8653 number of unused frag: 63 total overlapped frag: 2 1 branch 1 branch -------------------------- iteration: 3 frag is used? False seeding frag: 0xf9f76d57_0075336 Length: 8601 number of unused frag: 62 total overlapped frag: 2 1 frag_used 1 frag_used -------------------------- iteration: 4 frag is used? False seeding frag: 0x24a73353_0082329 Length: 8529 number of unused frag: 61 total overlapped frag: 2 1 frag_used 1 terminate_2 -------------------------- iteration: 5 frag is used? False seeding frag: 0x99a801d6_0119226 Length: 8390 number of unused frag: 60 total overlapped frag: 2 1 frag_used 1 frag_used -------------------------- iteration: 6 frag is used? False seeding frag: 0x3176999f_0092788 Length: 8012 number of unused frag: 59 total overlapped frag: 2 1 frag_used 1 terminate_2 -------------------------- iteration: 7 frag is used? False seeding frag: 0xd1f5cb9c_0041132 Length: 7742 number of unused frag: 58 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 8 frag is used? False seeding frag: 0x6a67d4d1_0054113 Length: 7635 number of unused frag: 57 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 9 frag is used? False seeding frag: 0x5e62b551_0087810 Length: 7612 number of unused frag: 56 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 10 frag is used? False seeding frag: 0x47c817fa_0037671 Length: 7592 number of unused frag: 55 total overlapped frag: 2 1 frag_used 1 frag_used -------------------------- iteration: 11 frag is used? False seeding frag: 0xd34bbcb8_0096026 Length: 7586 number of unused frag: 54 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 12 frag is used? False seeding frag: 0xde53c97d_0130793 Length: 7568 number of unused frag: 53 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 13 frag is used? False seeding frag: 0xbab42971_0019685 Length: 7462 number of unused frag: 52 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 14 frag is used? False seeding frag: 0x95dcccbf_0021944 Length: 7446 number of unused frag: 51 total overlapped frag: 2 1 frag_used 1 frag_used -------------------------- iteration: 15 frag is used? False seeding frag: 0x7b49a930_0025802 Length: 7402 number of unused frag: 50 total overlapped frag: 2 1 frag_used 1 frag_used -------------------------- iteration: 16 frag is used? False seeding frag: 0xec2de5ef_0014019 Length: 7335 number of unused frag: 49 total overlapped frag: 2 1 frag_used 1 frag_used -------------------------- iteration: 17 frag is used? False seeding frag: 0xb48bdbdf_0062004 Length: 7033 number of unused frag: 48 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 18 frag is used? False seeding frag: 0x201a3501_0131298 Length: 6986 number of unused frag: 47 total overlapped frag: 2 1 frag_used 1 terminate_2 -------------------------- iteration: 19 frag is used? False seeding frag: 0xa3d2663c_0086930 Length: 6961 number of unused frag: 46 total overlapped frag: 2 1 frag_used 1 terminate_2 -------------------------- iteration: 20 frag is used? False seeding frag: 0xe092bfc0_0025000 Length: 6941 number of unused frag: 45 total overlapped frag: 2 1 frag_used 1 terminate_2 -------------------------- iteration: 21 frag is used? False seeding frag: 0x101cf700_0133174 Length: 6913 number of unused frag: 44 total overlapped frag: 2 1 terminate_2 1 terminate_2 -------------------------- iteration: 22 frag is used? False seeding frag: 0x27b1b79d_0098081 Length: 6687 number of unused frag: 43 total overlapped frag: 2 1 frag_used 1 frag_used -------------------------- iteration: 23 frag is used? False seeding frag: 0x746a0885_0091093 Length: 6663 number of unused frag: 42 total overlapped frag: 2 1 frag_used 1 frag_used -------------------------- iteration: 24 frag is used? False seeding frag: 0xf16e1447_0062854 Length: 6502 number of unused frag: 41 total overlapped frag: 2 1 terminate_2 1 terminate_2 -------------------------- iteration: 25 frag is used? False seeding frag: 0x2e1dde37_0073961 Length: 6444 number of unused frag: 40 total overlapped frag: 2 1 frag_used 1 terminate_2 -------------------------- iteration: 26 frag is used? False seeding frag: 0x39af8897_0112579 Length: 6439 number of unused frag: 39 total overlapped frag: 2 1 terminate_2 1 terminate_2 -------------------------- iteration: 27 frag is used? False seeding frag: 0x4252156_0076315 Length: 6326 number of unused frag: 38 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 28 frag is used? False seeding frag: 0x8b32b9de_0082459 Length: 6255 number of unused frag: 37 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 29 frag is used? False seeding frag: 0xc0298763_0082661 Length: 6235 number of unused frag: 36 total overlapped frag: 2 1 frag_used 1 frag_used -------------------------- iteration: 30 frag is used? False seeding frag: 0xfe2ddc8f_0006613 Length: 6219 number of unused frag: 35 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 31 frag is used? False seeding frag: 0xf2c9d7d9_0019647 Length: 6209 number of unused frag: 34 total overlapped frag: 2 1 frag_used 1 terminate_2 -------------------------- iteration: 32 frag is used? False seeding frag: 0x5f497891_0044312 Length: 6178 number of unused frag: 33 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 33 frag is used? False seeding frag: 0x9d1fd3d5_0003951 Length: 6142 number of unused frag: 32 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 34 frag is used? False seeding frag: 0xd3405c1f_0120478 Length: 6056 number of unused frag: 31 total overlapped frag: 2 1 frag_used 1 frag_used -------------------------- iteration: 35 frag is used? False seeding frag: 0xd84402bc_0135092 Length: 6022 number of unused frag: 30 total overlapped frag: 2 1 terminate_2 1 terminate_2 -------------------------- iteration: 36 frag is used? False seeding frag: 0x8e9f3eb7_0077492 Length: 6020 number of unused frag: 29 total overlapped frag: 2 1 frag_used 1 terminate_2 -------------------------- iteration: 37 frag is used? False seeding frag: 0x6560473d_0129910 Length: 6009 number of unused frag: 28 total overlapped frag: 2 1 frag_used 1 terminate_2 -------------------------- iteration: 38 frag is used? False seeding frag: 0x8e21ba01_0028034 Length: 5945 number of unused frag: 27 total overlapped frag: 2 1 frag_used 1 terminate_2 -------------------------- iteration: 39 frag is used? False seeding frag: 0x328d5dfc_0076264 Length: 5867 number of unused frag: 26 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 40 frag is used? False seeding frag: 0x8c4c9750_0115911 Length: 5854 number of unused frag: 25 total overlapped frag: 2 1 frag_used 1 terminate_2 -------------------------- iteration: 41 frag is used? False seeding frag: 0x99d966a0_0043182 Length: 5848 number of unused frag: 24 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 42 frag is used? False seeding frag: 0xebaae05c_0001651 Length: 5718 number of unused frag: 23 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 43 frag is used? False seeding frag: 0x6015c8f3_0137444 Length: 5607 number of unused frag: 22 total overlapped frag: 2 1 frag_used 1 terminate_2 -------------------------- iteration: 44 frag is used? False seeding frag: 0x820c3016_0097966 Length: 5594 number of unused frag: 21 total overlapped frag: 2 1 frag_used 1 frag_used -------------------------- iteration: 45 frag is used? False seeding frag: 0xfb73db8e_0019839 Length: 5523 number of unused frag: 20 total overlapped frag: 2 1 frag_used 1 frag_used -------------------------- iteration: 46 frag is used? False seeding frag: 0x5842b8b1_0141185 Length: 5483 number of unused frag: 19 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 47 frag is used? False seeding frag: 0x2876a699_0009758 Length: 5403 number of unused frag: 18 total overlapped frag: 2 1 frag_used 1 terminate_2 -------------------------- iteration: 48 frag is used? False seeding frag: 0xf936c2fd_0094203 Length: 5372 number of unused frag: 17 total overlapped frag: 2 1 frag_used 1 terminate_2 -------------------------- iteration: 49 frag is used? False seeding frag: 0xe5322dba_0130928 Length: 5027 number of unused frag: 16 total overlapped frag: 2 1 frag_used 1 frag_used -------------------------- iteration: 50 frag is used? False seeding frag: 0x2ac1140b_0130957 Length: 4894 number of unused frag: 15 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 51 frag is used? False seeding frag: 0x53d92d4b_0116355 Length: 4758 number of unused frag: 14 total overlapped frag: 2 1 frag_used 1 frag_used -------------------------- iteration: 52 frag is used? False seeding frag: 0xe09961a7_0000768 Length: 4746 number of unused frag: 13 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 53 frag is used? False seeding frag: 0x6bdaa7bf_0024191 Length: 4530 number of unused frag: 12 total overlapped frag: 2 1 terminate_2 1 terminate_2 -------------------------- iteration: 54 frag is used? False seeding frag: 0x2d6f1fdb_0064584 Length: 4419 number of unused frag: 11 total overlapped frag: 2 1 frag_used 1 terminate_2 -------------------------- iteration: 55 frag is used? False seeding frag: 0xe6c70a1b_0009216 Length: 4386 number of unused frag: 10 total overlapped frag: 2 1 frag_used 1 frag_used -------------------------- iteration: 56 frag is used? False seeding frag: 0x1bed69cc_0052813 Length: 4332 number of unused frag: 9 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 57 frag is used? False seeding frag: 0xc48ec0c3_0089324 Length: 4173 number of unused frag: 8 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 58 frag is used? False seeding frag: 0x2afb8806_0097596 Length: 3840 number of unused frag: 7 total overlapped frag: 2 1 terminate_2 1 terminate_2 -------------------------- iteration: 59 frag is used? False seeding frag: 0xf953ded0_0012908 Length: 2975 number of unused frag: 6 total overlapped frag: 2 1 frag_used 1 terminate_2 -------------------------- iteration: 60 frag is used? False seeding frag: 0x3b256f1f_0102519 Length: 2397 number of unused frag: 5 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 61 frag is used? False seeding frag: 0x3f31dab0_0019006 Length: 2040 number of unused frag: 4 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 62 frag is used? False seeding frag: 0xcc88cea2_0015902 Length: 1651 number of unused frag: 3 total overlapped frag: 2 1 terminate_2 1 frag_used -------------------------- iteration: 63 frag is used? False seeding frag: 0x2708c7d8_0113402 Length: 1617 number of unused frag: 2 total overlapped frag: 2 1 terminate_2 1 terminate_2 -------------------------- iteration: 64 frag is used? False seeding frag: 0x612f720b_0078835 Length: 1275 number of unused frag: 1 total overlapped frag: 2 1 terminate_2 1 frag_used --------------------------
One can align the contigs to the canoincal E. coli K12 reference to evaluate the quality of the assembly. One can see from the alignment data below, it seems that some fragments have lower accuracy that does not pass the 98% identity threshold to be called "overlapped" with other fragments. Three of them does not have full alginments. In this case, one will need to examine the raw reads that corresponds to these three fragments to see whether they are minor variants or some lower quality reads are not filtered.
!cat singleton_tig_000* > singletons.fa
!blasr singletons.fa ecoli_k12_MG1655.fasta -m 4 -noSplitSubreads -nCandidates 20 -bestn 1 | awk '{print $1" "$2" "$4" "$6" "$7" "$8" "$8-($7-$6)}'
singleton_tig00001 Escherichia_coli_K12 97.4017 0 9562 9562 0 singleton_tig00002 Escherichia_coli_K12 97.3956 1431 8653 8653 1431 singleton_tig00003 Escherichia_coli_K12 98.432 0 8601 8601 0 singleton_tig00004 Escherichia_coli_K12 98.3834 0 8529 8529 0 singleton_tig00005 Escherichia_coli_K12 99.4192 0 8390 8390 0 singleton_tig00006 Escherichia_coli_K12 97.3447 0 8012 8012 0 singleton_tig00007 Escherichia_coli_K12 98.1237 0 7742 7742 0 singleton_tig00008 Escherichia_coli_K12 97.7821 0 7635 7635 0 singleton_tig00009 Escherichia_coli_K12 98.2443 0 7612 7612 0 singleton_tig00010 Escherichia_coli_K12 99.1766 2 7592 7592 2 singleton_tig00011 Escherichia_coli_K12 97.1927 0 7586 7586 0 singleton_tig00012 Escherichia_coli_K12 96.5121 0 7558 7568 10 singleton_tig00013 Escherichia_coli_K12 97.4269 0 7462 7462 0 singleton_tig00014 Escherichia_coli_K12 98.3228 0 7446 7446 0 singleton_tig00015 Escherichia_coli_K12 99.4492 0 7402 7402 0 singleton_tig00016 Escherichia_coli_K12 98.9734 0 7335 7335 0 singleton_tig00017 Escherichia_coli_K12 97.9114 0 7033 7033 0 singleton_tig00018 Escherichia_coli_K12 97.6773 0 6986 6986 0 singleton_tig00019 Escherichia_coli_K12 96.437 0 6961 6961 0 singleton_tig00020 Escherichia_coli_K12 97.4028 0 6941 6941 0 singleton_tig00021 Escherichia_coli_K12 97.5844 0 6913 6913 0 singleton_tig00022 Escherichia_coli_K12 99.1106 0 6687 6687 0 singleton_tig00023 Escherichia_coli_K12 98.9452 0 6663 6663 0 singleton_tig00024 Escherichia_coli_K12 97.4351 0 6502 6502 0 singleton_tig00025 Escherichia_coli_K12 97.3736 0 6444 6444 0 singleton_tig00026 Escherichia_coli_K12 96.5362 0 6439 6439 0 singleton_tig00027 Escherichia_coli_K12 97.9105 0 6326 6326 0 singleton_tig00028 Escherichia_coli_K12 97.7177 0 6255 6255 0 singleton_tig00029 Escherichia_coli_K12 98.0028 0 6235 6235 0 singleton_tig00030 Escherichia_coli_K12 98.09 0 6219 6219 0 singleton_tig00031 Escherichia_coli_K12 95.9777 0 6209 6209 0 singleton_tig00032 Escherichia_coli_K12 97.8922 0 6178 6178 0 singleton_tig00033 Escherichia_coli_K12 97.8164 0 6142 6142 0 singleton_tig00034 Escherichia_coli_K12 99.2136 0 6056 6056 0 singleton_tig00035 Escherichia_coli_K12 97.2532 0 6022 6022 0 singleton_tig00036 Escherichia_coli_K12 97.854 0 6020 6020 0 singleton_tig00037 Escherichia_coli_K12 97.1526 0 6009 6009 0 singleton_tig00038 Escherichia_coli_K12 96.5394 0 5945 5945 0 singleton_tig00039 Escherichia_coli_K12 97.2628 0 5867 5867 0 singleton_tig00040 Escherichia_coli_K12 97.3697 0 5854 5854 0 singleton_tig00041 Escherichia_coli_K12 97.9883 0 5848 5848 0 singleton_tig00042 Escherichia_coli_K12 97.7751 0 5718 5718 0 singleton_tig00043 Escherichia_coli_K12 97.9378 0 5607 5607 0 singleton_tig00044 Escherichia_coli_K12 99.2899 0 5594 5594 0 singleton_tig00045 Escherichia_coli_K12 98.4484 0 5523 5523 0 singleton_tig00046 Escherichia_coli_K12 97.7524 0 5483 5483 0 singleton_tig00047 Escherichia_coli_K12 83.9108 0 5403 5403 0 singleton_tig00048 Escherichia_coli_K12 97.5472 0 5372 5372 0 singleton_tig00049 Escherichia_coli_K12 97.8931 0 5027 5027 0 singleton_tig00050 Escherichia_coli_K12 97.9968 0 4894 4894 0 singleton_tig00051 Escherichia_coli_K12 97.739 0 4758 4758 0 singleton_tig00052 Escherichia_coli_K12 94.8541 58 4746 4746 58 singleton_tig00053 Escherichia_coli_K12 97.3543 0 4530 4530 0 singleton_tig00054 Escherichia_coli_K12 95.1304 0 4419 4419 0 singleton_tig00055 Escherichia_coli_K12 96.9268 0 4386 4386 0 singleton_tig00056 Escherichia_coli_K12 97.6502 0 4332 4332 0 singleton_tig00057 Escherichia_coli_K12 97.4755 0 4173 4173 0 singleton_tig00058 Escherichia_coli_K12 97.8816 0 3840 3840 0 singleton_tig00059 Escherichia_coli_K12 97.7303 0 2975 2975 0 singleton_tig00060 Escherichia_coli_K12 96.2158 0 2397 2397 0 singleton_tig00061 Escherichia_coli_K12 97.5586 0 2040 2040 0 singleton_tig00062 Escherichia_coli_K12 97.5768 0 1651 1651 0 singleton_tig00063 Escherichia_coli_K12 97.8221 0 1617 1617 0 singleton_tig00064 Escherichia_coli_K12 97.032 0 1275 1275 0
The main contig is 4650011bp. It seems that it should cover the whole genome. We can see the whole genome alignment using
gepard
. We can see there is no larger scale mis-assembly. Promising!! (The strand of the K12 for this data is actuall slight different from the canonical one. We do expect to see some small structure variations.)
Image(filename="tig_align.jpeg")
We don't expect to get high accuracy from the "draft" assembly. The way we construct the contig is just to copy-and-paste from the pre-assembled reads which have mean identity 99.789% to the reference genome. For a 4.7Mb genome, we will expect 4700000 * (1-0.99787) ~ 10k differences. One can do another round of consensus using the pre-assembled reads. Or, we can align the raw reads on top of the draft contig and apply the "Quiver" algorithm to get the best accuracy from all raw data. Let's see how "Quiver" helps to remove the errors in the draft contig.
First, let's evalute the differece between the draft contigs to the canonical reference with the dnadiff
from
Mummer3 package
#!dnadiff ecoli_k12_MG1655.fasta tig_00000.fa -p diff_tig0
!echo the number of SNPs is `cat diff_tig0.snps | wc | awk '{print $1}'`
the number of SNPs is 10026
The draft contig has about 10k SNPs as what we expect from the estimation earlier. We apply Quvier
consensus algorithm on the the draft contig using the following script::
#!/bin/bash
export SEYMOUR_HOME=/PathTo/PacBio/SMRTAnalysis
. $SEYMOUR_HOME/etc/setup.sh
cd /Path/To/WorkingDirectory
cp tig_00000.fa asm.ctg.fasta
referenceUploader -c -p $PWD -n assembly -f asm.ctg.fasta --skipIndexUpdate
compareSequences.py --info --useGuidedAlign --algorithm=blasr --nproc=24 --noXML --h5mode=w --h5fn=out.cmp.h5 --minAccuracy=0.70 --minLength=200 \
-x -nCandidates 50 -x -minMatch 12 -x -bestn 1 -x -minPctIdentity 70.0 /Path/To/WorkingDirectory/input.fofn assembly/
loadPulses /Path/To/WorkingDirectory/input.fofn out.cmp.h5\
-metrics DeletionQV,IPD,InsertionQV,PulseWidth,QualityValue,MergeQV,SubstitutionQV,DeletionTag -byread \
cmph5tools.py sort out.cmp.h5 --tmp /scratch
variantCaller.py --algorithm quiver -j 16 --referenceFilename assembly/sequence/assembly.fasta \
--parameters best -o output.gff -o output.fasta -o output.fastq -q 0 -X 80 -x 5 --mapQvThreshold 0 out.cmp.h5
This will generate the consensus sequence using Quiver. The consensus output is in output.fasta
. We find 26 SNPs in the final consensus results. This gives a lower bound of the final assembly phred QV about -10*log10(26/4639675) = 52.5. One can see that the Quiver
algorithm which utilizes all important information from the raw trace signal is able to correct the ~10k errors in the
draft assembly. This allows us more freedom while constructing the initial draft assembly. We do not need to construct perfect draft assembly from perfect reads. The goal for constructing the draft is to get the contiguity right. Given the long reads, one can expect that most mapping is unambiguous and that will lead to almost perfect consensus using Quiver
.
#dnadiff ecoli_k12_MG1655.fasta output.fasta -p diff_quiver_tig0
!echo the number of SNPs is `cat diff_quiver_tig0.snps | wc | awk '{print $1}'`
from math import log
print "The phred scale QV of the assembly is about %0.1f" % (-10*log(26.0/4639675)/log(10))
print "The concordnace is about %0.5f%%" % (100*(1-26.0/4639675))
the number of SNPs is 26 The phred scale QV of the assembly is about 52.5 The concordnace is about 99.99944%
I had some fun with this exercise. I think this approach is general but it will
need more tests for sure. We only have n=1 successful story for now. However,
the reality is when we can generate longer and longer reads, the assembly
problem will become easier and easier. DNA sequencing is about connecting
different bases across distance. Some problems can be only solved by enough
read length spans. I do like to see more fundamental theories published about
genome assembly and haplotyping (e.g. the classical Lander-Waterman paper and
the recent careful analysis done (preprint)
by Davie Tse's group). Getting good consensus from long reads is
probably a relative simpler problem than dealing with combinatorial explosion
when one tries to infer long range information using short reads. Taking
bacteria assembly as an example, it is likely impossible to assemble the E.
coli K12 genome to single contig with single-end short read library. Without
any long pair-end or jumping libraries, the single-end short read data simply
does not the power to resolve various scales of repeat structure in the genome.
Such constraints can not be solved by simply sequencing more (see Tse's argument).
Actually, regardless how cheap DNA sequencing is, without proper understanding
the fundamental theories and mathematical constraints, one could just waste
money collecting tons of useless data. Cheap data still costs money. Cheap data
that can not solve problems is actually expansive. One can sequence E coli. K12 genome
to 10000x coverage with short read technologies cheaply. However, without
proper libraries to get long range information, 10000x coverage from short fragments
will still lead to fragmented assemblies. One the other hand, one can see from
the example in this IPython
Notebook. Once we can generate good quality long reads,
even an assembly newbie like me can put together a prototype of an assembler with python
in a few days. By the way, it only takes 20 seconds on my 2012 MacBook Air (Intel i7) to
get the draft assembly from the m4
file.
!time python Write_An_Assembler.py > /dev/null
real 0m21.142s user 0m20.326s sys 0m0.745s