Write A Genome Assembler with blasr and (I)Python

Introduction

I am not sure it is right thing to do some coding in a vacation. (See this tweet, not sure if I would agree.) Anyway, before the vacation, I decided to organize whole bunch of random papers I collected in the last few years in my laptop. I eventually felt that I should read some of some theoretical papers about genome assembly again. I grabbed Gene Meyer's paper, and started to think about the problem about constructing unitigs (high-confident contigs). Before I tried to read the paper in detail, I just felt maybe that it was useful to write some quick code to check out what real data looked like. I started writing some code visualizing the local overlapping and generating the global overlapping graph. It was pretty straight forward and quite inspiring from the visualization. The visualization motived me to write more ad-hoc code in IPython to go beyond generating simple visualization during the vacation. I started to implement a very simple greedy algorithm to connect the input DNA fragments. Eventually, I found I can atucally get the whole genome assembly right!! This Notebook shows the work step by step toward a simple genome assembler for long read data using IPython and Python.

The input data was the E. coli K12 data from PacBio(R) DevNet Site. I generated a new set of pre-assembled reads as the input DNA fragment using HBAR-DTK. The statistics of the input reads is shown below:

Pre-assembled Read Stats
#Seqs   17497
Min     448
1st Qu. 5365
Median  6149
Mean    6193
3rd Qu. 7189
Max     14762
Total   108363627
n50     6521
n90     4694
n95     3864

This is about 23x of average length of 6.2kb reads of E. coli K12 genome. The mean sequence identity when aligning the pre-assembled reads to the canonical E. coli K12 reference is 99.787% (median identity = 99.921%). The pre-assembled read data is available for download from this link to pre_assembled_read.fa. According the Lander-Waterman statistics, this set of reads is enough to get single contig for the single chromosome. The only thing one needs to worry about is that whether the repeats in genome can be uniquely. Since it is shown one can use Celera Assembler to assemble this data set into one single contig, this implies that all of the repeats of the rRNA operon can be resolved from this data set. Typically, one might need multiple libraries, especially those long jumping libraries, to resolve long repeats from short read (~100bp) data. For example, ALLPATHS-LG can use multiple short-read libraries and long read data to get perfect assemblies for bacteria Paper ALLPATHS-LG Blog. However, when one can get long range information from one single library, the assembly process can be probably greatly simplified. As demonstrated here, one can actually get pretty good assembly using Python code with blasr for overlapping and Quiver for polishing the residual errors.

For long read data, the de Bruijn graph with some sophisticated graph operations to get around the limit of short read data to assemble genomes may not be the right approach. The Overlap-Layout-Consensus can be much more efficient for assembling long reads. Also, when the reads are getting longer, the overlapping between the reads becomes more and more specific. This will significantly reduce the complexity of the overlapping graph between the reads. (Here we define an overlapping graph as a graph where the nodes are reads and the edges are constructed from the pairs of proper overlapped reads.) When the overlapping graph is simple, it becomes possible to use some naive algorithm to get a reasonably good assembly. To make this clear, one can image if the read lengths are as long as the genome, the number of overlapping one needs to worry about will be in single digit and there will be no ambiguity at all. The overlapping and layout will be trivial in such extreme case.

When I use Celera Assembler (CA) to assemble genomes, I always try to peek the intermediate data in order to understand the process better and maybe fix few things manually. Michael Schatz (@mike_schatz) and Adam Phillippy both pointed to me some very useful information about CA. (Thanks, Michael and Adam.) tigStore is a very useful command to see the relationship between the generated unitigs and the initial DNA fragments. The file 4-unitigger/best.edges contains the information best overlapped reads which one can use to construct the overlapping graph and one can learn a lot subtle points about genome assembly by visualizing those. This notebook is definitely inspired by the great work done by the CA's developers.

This IPython notebook is just a "Notebook". I write these code for fun and for (self-)educational purposes. It shows that given the good long read data and the proper tools blasr, pbdagcon, HBAR-DTK, and quiver one can learn how to do genome assembly with a "not-so-high-performance" language Python. The great IPython Notebook allows me to try out different code snippet and different algorithm ideas without a lot of overhead. One should notice this code in this IPython Notebook is mainly for fun, not for "production". I chose the least resistant path for coding. Coding style and code performance were considered but not taking seriously as I was doing a lot of experiments for different algorithm approaches when I coded this notebook.

I consider this Notebook is as a personal project for fun. The data and the used software can be all downloaded publicly. Although I am an employee of PacBio(R), this notebook is developed in my own time. PacBio will not be responsible for any information associated with this Notebook.

--Jason Chin, Apr 2, 2013

Get Data

I put the pre-assembled read file in Figshare. Here is the link to the pre_assembled_read.fa. For most of part of the Notebook, one does not need the fasta file. You can also visit the project data page to see other data files and this code.

Overlapping

The first step to assemble the genome is to use blasr to find the overlapping between the reads. This can be done by executing this commend:

blasr pre_assembled_reads.fa pre_assembled_reads.fa -m 4 -nproc 32 -bestn 32 -maxLCPLength 15 -nCandidates 36 -out pr_pr_bn32.m4 -noSplitSubreads

We ask blasr to output the coordinates of the alignment between the reads by specified the -m 4 option. The -nproc 32 will let blasr to use 32 threads. For each reads, we allow blasr to find the other 32 best overlaps by using -bestn 32 and scaning 36 potential hit candidateswith -nCandidates 36. The -noSplitSubreads has only cosmatic effect here. It makes the ouput query identifier cleaner.

blasr is a general purpose aligner. It is not really tuned to do overlapping work. One will see sometimes it does not handle the ends of the alignment as one would like it to do. The containment_tolerance variables used below was introduced for being less picky at the ends of the alignment. However, one would need to worry about potential mis-assemblies if the tolerance is too big.

The aligned results can be found in the FigShare project page mentioned above. Or, one can download it from this link.

"blasr -m 4" output format

The cells below show the content of the m4 file. It containes the alignment information between the reads. The fields are query identifier, target identifier, alignment score (more negative -> better alignment), alignment identity, query strand, query start, query end, query length, target strand, target start, target end, target length. The rest of the fields are not used. One trap here is that the "target start" and "target end" are not relative to the original sequence if "target strand" is 1 . When "target strand= 1", it means that the reverse-complimentary strand is aligned and the "target start" and "target end" are the coordinates in the reverse complimentary strand. In the stand blasr output, "query strand" is always 0.

In [1]:
!head -10 pr_pr_bn32.m4
0xc4b462ba_0096978 0xc4b462ba_0096978 -14495 100 0 0 2899 2899 0 0 2899 2899 50 60847 -2716.48 0 21
0xc4b462ba_0096978 0x79c77458_0006261 -14400 99.4505 0 0 2899 2899 1 2502 5411 10399 50 60890 -2683.04 0 21
0xc4b462ba_0096978 0x72a8c4f1_0018952 -14395 99.4164 0 0 2899 2899 0 483 3393 5755 50 60893 -2658.33 0 21
0xc4b462ba_0096978 0x4dd7c9d5_0133333 -14390 99.4162 0 0 2899 2899 1 2718 5626 6023 50 60894 -2683.04 0 21
0xc4b462ba_0096978 0x726e8550_0094067 -14385 99.3821 0 0 2899 2899 0 27 2936 7163 50 60897 -2658.76 0 21
0xc4b462ba_0096978 0x833174a6_0078354 -14385 99.3821 0 0 2899 2899 1 2676 5585 6916 50 60897 -2683.28 0 21
0xc4b462ba_0096978 0xc52dd432_0103092 -14380 99.3819 0 0 2899 2899 1 2565 5472 6025 50 60900 -2666.7 0 21
0xc4b462ba_0096978 0x48e2fc97_0054635 -14375 99.3477 0 0 2899 2899 0 2621 5529 7539 50 60902 -2658.24 0 21
0xc4b462ba_0096978 0xea0a0bab_0128842 -14375 99.3477 0 0 2899 2899 0 7684 10592 11923 50 60901 -2657.5 0 21
0xc4b462ba_0096978 0x4949e7c9_0060870 -14360 99.3132 0 0 2899 2899 1 2927 5832 6621 50 60907 -2666.12 0 21

Convert blasr's output to python internal data

We deinfe read_node class for the objects that holds the local overlapping information of each read. It simply holds a python dictionary (hash_map) where the key is the target identifier and the elements from m4 line are converted to the right types and stored. Only "non-contaminated"" alignments are stored. If one read is full aligned within the other read, such alignment is not stored in a read_node. The get_overlap_data function processes the m4 file and determine whether the alignments are contaminated with some tolerence of the ends of the alignments.

In [2]:
import sys
import os
class read_node_0:  #experimenting on keeping the data in a list rather than a dictionary, not used

    def __init__(self, name, length):
        self.name = name
        self.length = length
        self.hits = [] 
        self.hit_map = None
        
    def add_hit(self, aln_info):
        self.hits.append ( aln_info )

    def __getitem__(self, target_name):
        if self.hit_map == None:
            self.hit_map = dict(zip( [h[0] for h in self.hits], self.hits ) )
        return self.hit_map[target_name]

class read_node:

    def __init__(self, name, length):
        self.name = name
        self.length = length
        self.hit_map = {}
        
    def add_hit(self, aln_info):
        self.hit_map[aln_info[0]] = aln_info

    def __getitem__(self, target_name):
        return self.hit_map[target_name]
    
    @property
    def hits(self):  #another convenient representation of the same data
        return self.hit_map.values()


def get_overlap_data(m4_filename):
    containment_tolerance = 50 
    permitted_error_pct = 2
    overlap_data = {}
    contained_reads = set()
    with open(m4_filename) as m4_f:
        for l in m4_f:
            l = l.strip().split()
            q_name, t_name =l[0:2]
            if q_name == t_name:
                continue
            aln_score = int(l[2])
            aln_idt = float(l[3])
            
            if aln_idt < 100-permitted_error_pct:
                continue
    
            q_strand, q_start, q_end, q_len = ( int(x) for x in l[4:8])
            t_strand, t_start, t_end, t_len = ( int(x) for x in l[8:12])
    
            if q_len - (q_end - q_start) < containment_tolerance:
                contained_reads.add(q_name)
    
            if t_len - (t_end - t_start) < containment_tolerance:
                contained_reads.add(t_name)
    
            if q_name not in overlap_data:
                overlap_data[ q_name ] = read_node(q_name, q_len)
    
            assert q_strand == 0
            if t_name not in [x[0] for x in overlap_data[ q_name ].hits]:
                overlap_data[ q_name ].add_hit( (t_name, 
                                                 aln_score, 
                                                 aln_idt, 
                                                 (q_strand, q_start, q_end, q_len),
                                                 (t_strand, t_start, t_end, t_len) ) )
    
            #symmetrized the alignment record
            if t_name not in overlap_data:
                overlap_data[ t_name ] = read_node(t_name, t_len)
    
            if q_name not in [x[0] for x in overlap_data[ t_name ].hits]:
                if t_strand == 1: 
                    t_start, t_end = t_len - t_end, t_len - t_start
                    q_start, q_end = q_len - q_end, q_len - q_start
                    overlap_data[ t_name ].add_hit( (q_name, 
                                                 aln_score, 
                                                 aln_idt, 
                                                 (0, t_start, t_end, t_len),
                                                 (1, q_start, q_end, q_len) ) )
                else:
                    overlap_data[ t_name ].add_hit( (q_name, 
                                                 aln_score, 
                                                 aln_idt, 
                                                 (0, t_start, t_end, t_len),
                                                 (0, q_start, q_end, q_len) ) )
    return overlap_data, contained_reads
In [3]:
overlap_data, contained_reads = get_overlap_data("pr_pr_bn32.m4")

A simple function to convert the overlapping data to a GML file

The generate_overlap_gml takes the overlap_data and the contained_reads to generate a overlap graph in GML format. (You will need to install networkx in your Python environment to use this code.)

In [4]:
def generate_overlap_gml(overlap_data, contained_reads, gml_filename):
    containment_tolerance = 50 
    permitted_error_pct = 2
    import networkx as nx
    G=nx.DiGraph()
    node_in_graph = set()
    for q_name in overlap_data:
        if q_name in contained_reads:
            continue
        if q_name not in node_in_graph:
            G.add_node(q_name)
        targets = overlap_data[ q_name ].hits
        targets_3prime = [ h for h in targets if h[4][1] < containment_tolerance and h[0] not in contained_reads]
        targets_5prime = [ h for h in targets if h[3][1] < containment_tolerance and h[0] not in contained_reads]
        targets_3prime.sort(key = lambda k:k[1])
        targets_5prime.sort(key = lambda k:k[1])

        if len(targets_3prime) > 0:
            t = targets_3prime[0]
            t_name = t[0]
            if t_name not in node_in_graph:
                G.add_node(t_name)
            G.add_edge(q_name, t_name)

        if len(targets_5prime) > 0:
            t = targets_5prime[0]
            t_name = t[0]
            if t_name not in node_in_graph:
                G.add_node(t_name)
            G.add_edge(q_name, t_name)
    nx.write_gml(G, gml_filename)
In [5]:
generate_overlap_gml(overlap_data, contained_reads, "overlap_graph.gml")

Overlapping graph visualized by Gephi

Ther generated overlap_graph.gml can be visualized by Gephi. One can use "YifanHu's Multilevel" graph layout algorithm to do the rough layout followed by "ForceAtlas 2" algorithm to smooth the graph. The cell below shows the global overlapping structure. One can see that it is topologically a circle plus a number of singletons. Namely, the read data does capture the whole genome in one single circular layout.

In [6]:
from IPython.display import Image
Image(filename = "overlap_graph01.png")
Out[6]:

We can also see some local "bubbles" in the graph.

In [7]:
Image(filename = "overlap_graph02.png")
Out[7]:
In [8]:
Image(filename = "overlap_graph03.png")
Out[8]:

Some utility constants, functions and a class for viewing the local overlaping with SVG files

The cell below defines some contants and utility funtions so we can generate SVG to show the local overlapping alignments.

In [9]:
arrow_defs = """<defs>
    <marker id="Triangle_green" stroke="green" fill="green"
      viewBox="0 0 10 10" refX="0" refY="5"
      markerUnits="strokeWidth"
      markerWidth="4" markerHeight="3"
      orient="auto">
      <path d="M 0 0 L 10 5 L 0 10 z" />
    </marker>
    <marker id="Triangle_blue" stroke="blue" fill="blue"
      viewBox="0 0 10 10" refX="0" refY="5"
      markerUnits="strokeWidth"
      markerWidth="4" markerHeight="3"
      orient="auto">
      <path d="M 0 0 L 10 5 L 0 10 z" />
    </marker>
    <marker id="Triangle_black" stroke="black" fill="black"
      viewBox="0 0 10 10" refX="0" refY="5"
      markerUnits="strokeWidth"
      markerWidth="4" markerHeight="3"
      orient="auto">
      <path d="M 0 0 L 10 5 L 0 10 z" />
    </marker>
    <marker id="Triangle_red" stroke="red" fill="red"
      viewBox="0 0 10 10" refX="0" refY="5"
      markerUnits="strokeWidth"
      markerWidth="4" markerHeight="3"
      orient="auto">
      <path d="M 0 0 L 10 5 L 0 10 z" />
    </marker>
  </defs>"""

def svg_arrow( x1, y1, x2, y2, col, w):
    return """<g stroke = "%s" stroke-width = "%f" fill = "none">\
<path d = "M %f %f L %f %f" marker-end="url(#Triangle_%s)"/> </g>""" % (col, w, x1, y1, x2, y2, col)

def svg_line( x1, y1, x2, y2, col, w):
    return """<g stroke = "%s" stroke-width = "%f" fill = "none">\
<path d = "M %f %f L %f %f"/> </g>""" % (col, w, x1, y1, x2, y2)

class Overlap_SVG_view:
    def __init__(self):
        self.header = """<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg xmlns:svg="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg" version="1.1" width="900px" height="300px">
"""
        self.footer = """</svg>"""
        self.elements = []
        self._svg_data = None
        
    def add_element(self, elm):
        self.elements.append(elm)
        
    def _construct_SVG(self):
        SVG_str = []
        SVG_str.append( self.header )
        for elm in self.elements:
            SVG_str.append( elm )
        SVG_str.append( self.footer )
        self._svg_data = "\n".join(SVG_str)
        self._svg_data = self._svg_data.decode('utf-8')

    def output(self, filename):
        if self._svg_data is None:
            self._construct_SVG()
        with open(filename, "w") as out_f:
            print >> out_f, self._svg_data
            
    def _repr_svg_(self):
        if self._svg_data is None:
            self._construct_SVG()
        return self._svg_data
        
    @property
    def svg(self):
        return SVG(self._repr_svg_())
            

generate_overlap_view: a function taking the overlapping data to generate a SVG file

The generate_overlap_view simple does several coordinate transformations to create a SVG file so we can see the local overlap around a given DNA fragment. The blank arrow indicates the fragment in interesting. Green and blue arrows indicate the forward and reversed proper overlapped alignments. The red arrows indicate containment alignments or improper alignments (locally inconsistent alignments due to repeats, mis-mapping or sequencing artifacts).

In [10]:
from IPython.display import SVG

import os

def generate_overlap_view(overlap_data, q_name):
    containment_tolerance = 10
    
    hits = overlap_data[q_name].hits
    SV = Overlap_SVG_view()
    SV.add_element(arrow_defs)
    x_offset = 10000
    x_scaling = 0.02
    x1 = 0
    x2 = overlap_data[q_name].length
    x1 += x_offset 
    x2 += x_offset
    x1 *= x_scaling
    x2 *= x_scaling
    y = 200
    SV.add_element(svg_arrow(x1, y, x2, y, "black", 4))
    hits.sort(key = lambda k: (k[3][1],k[3][3],k[3][2]))
    for h in hits:
        t_name, t_score, t_idt = h[0], h[1], h[2]
        if t_name in contained_reads:
            continue
        q_strand, q_start, q_end, q_len = h[3]
        t_strand, t_start, t_end, t_len = h[4]
        y -=10
        x1 = q_start
        x2 = q_end
        col = None
        
        if q_start < containment_tolerance and t_len - t_end < containment_tolerance:
            x1 = -t_start
        elif q_len - q_end < containment_tolerance and t_start < containment_tolerance:
            x2 = q_end + (t_len - t_end)
        else:
            col = "red"
            if t_strand == 1:
                x1, x2 = x2, x1
            print h
        if t_strand == 0 and col == None:
            col = "green"
        elif col == None:
            col = "blue"
            x1, x2 = x2, x1
        #print h
        #print min(x1, x2), max(x1, x2), abs(x1-x2)
        x1 += x_offset
        x2 += x_offset
        x1 *= x_scaling
        x2 *= x_scaling
        SV.add_element(svg_arrow(x1, y, x2, y, col, 3))
    return SV
    
q_name = overlap_data.keys()[0]
OSV = generate_overlap_view(overlap_data, q_name)
display(OSV.svg)
('0x9ba164ba_0123315', -32790, 99.8631, (0, 0, 6571, 6571), (0, 38, 6610, 7974))

Generate the "best overlap graph"

Given that we see a beautiful circle overlapping graph, we like to see if can choose a seeding frgament and extend the overlapping to form a contigs. The code below scans through the overlap_data and assign the best score alignments for both 3' and 5' of a read as the best partner. The return of the get_best_overlap_graph is a dictionary where the key is a fragment identifier and the value is the alignment information of the 3' and 5' best overlapping alignments.

In [11]:
def filter_overlap_hit(overlap_hit):
    containment_tolerance = 50
    q_strand, q_start, q_end, q_len = overlap_hit[3]
    t_strand, t_start, t_end, t_len = overlap_hit[4]
    if q_len - (q_end - q_start) < containment_tolerance and  t_len - (t_end - t_start) / t_len < containment_tolerance:
        return False
    elif overlap_hit[0] in contained_reads:
        return False
    else:
        return True

def get_best_overlap_graph(overlap_data):
    best_overlap_graph = {}
    containment_tolerance = 50
    
    for q_name in overlap_data:
        if q_name in contained_reads:
            continue
        
        if q_name not in best_overlap_graph:
            best_overlap_graph[q_name] = {"5p":None, "3p":None}
        
        
        targets = overlap_data[ q_name ].hits
        
        targets_5prime = [ h for h in targets if h[3][1] < containment_tolerance and filter_overlap_hit(h)]
        targets_3prime = [ h for h in targets if h[4][1] < containment_tolerance and filter_overlap_hit(h)]
    
        targets_5prime.sort(key = lambda k:k[1])
        targets_3prime.sort(key = lambda k:k[1])
    
        if len(targets_5prime) > 0:
            best_overlap_graph[q_name]["5p"] = targets_5prime[0]
            t_name = targets_5prime[0][0]
    
                
        if len(targets_3prime) > 0:
            best_overlap_graph[q_name]["3p"] = targets_3prime[0]
            t_name = targets_3prime[0][0]
    
    return best_overlap_graph

Check the output of get_best_overlap_graph

The cell below find the best alignment from the read of the first key in the best_overlap_graph. It shows the best 3' and 5' partners.

In [12]:
best_overlap_graph = get_best_overlap_graph(overlap_data)
q_name = best_overlap_graph.keys()[0]
print q_name, "has best partners:", best_overlap_graph[q_name]
0xf6bb6a22_0085668 has best partners: {'3p': ('0xdcf35477_0042197', -37729, 99.9206, (0, 1032, 8586, 8586), (1, 0, 7557, 9323)), '5p': ('0xd9132340_0068296', -40160, 99.373, (0, 0, 8128, 8586), (1, 436, 8525, 8525))}

Find a best overlapping path from a seeding fragment

The two functions defined below allows us to extend the overlapping through the chain of overlapped fragments. The extension is terminated when (1) some inconsistent overlapping is detected (2) the overlapped fragment is already used by other contigs (3) no overlapped fragment is found and (4) a circle is detected in the path. Such rules may not be the best way to generate unitigs but they are very simple to implement. At least, these rules do work well for this E. coli data set. One can image some more complicated scenarios that it will be necessary to refine these rules or add more rules to get the best results and avoid mis-assemblies.

In [13]:
def proper_overlap_hit(overlap_hit):
    containment_tolerance = 50
    q_strand, q_start, q_end, q_len = overlap_hit[3]
    t_strand, t_start, t_end, t_len = overlap_hit[4]
    if q_start < containment_tolerance:
        if t_len - t_end > containment_tolerance:
            return False
        else:
            return True
    if t_start < containment_tolerance:
        if q_len - q_end > containment_tolerance:
            return False
        else:
            return True
        
def find_path( q_name_end, frag_used = set() ):
    reverse_end = {"3p":"5p", "5p":"3p"}
    path = []
    path_q_name = set()
    q_name, end = q_name_end 
    if end == "5p":
        reversing_path = True
    else:
        reversing_path = False
        
    while 1:
        if q_name in frag_used:
            if reversing_path:
                path.reverse()
            return path, "frag_used"
        
        path.append( (q_name, end) )
        path_q_name.add(q_name)
        if q_name not in best_overlap_graph:
            if reversing_path:
                path.reverse()
            return path, "terminate_1" 
        next_hit = best_overlap_graph[q_name][end]
        #print next_hit
        if next_hit == None:
            if reversing_path:
                path.reverse()
            return path, "terminate_2"
        
        if next_hit[0] in best_overlap_graph: #if not mutual good hits, break the path
            
            # Using mutual best hit might be to strigent, 
            # bh = []
            #if best_overlap_graph[next_hit[0]]["5p"]:
            #    bh.append( best_overlap_graph[next_hit[0]]["5p"][0] )
            #if best_overlap_graph[next_hit[0]]["3p"]:
            #    bh.append( best_overlap_graph[next_hit[0]]["3p"][0] )
            
            bh = [h[0] for h in overlap_data[next_hit[0]].hits if proper_overlap_hit(h)]
       
            if q_name not in bh:
                if reversing_path:
                    path.reverse()
                return path, "branch"
        
        q_name = next_hit[0]
        
        if q_name in path_q_name:
            if reversing_path:
                path.reverse()
            return path, "circle"
        
        if next_hit[4][0] == 1: #reversed strand
            end = reverse_end[end]

Examples of the best overlapping paths

Here we show how to call the find_path function to get the overlapping path for both 5' and 3' from a seed fragment. We pick the longest fragment in the data set as the seed. One will notice that the total fragments in from both 5'-parh and 3'-path are greater than the total number of fragments in the best_overlap_graph. It mean that there are significant overlapping between the 5'-path and 3'-path. This is consistent with the genome is circular.

In [14]:
len_qname = [ (overlap_data[x].length, x) for x in best_overlap_graph.keys() ]
len_qname.sort()
q_name = len_qname[-1][1]
#q_name = "0xa4919d66_0061381"
print "The longest fragment is", q_name

path_5p, s_5p = find_path( (q_name, "5p") )
path_3p, s_3p = find_path( (q_name, "3p") )
print "The number of the fragment of the 5' path is", len(path_5p)
print "The number of the fragment of the 3' path is", len(path_3p)
print "The total number of framgent of both 3' and 5' path is %d." % (len(path_3p)+len(path_5p)-1)
print "%d is greater than the total number of fragments %d in the best_overlap_graph." % (len(path_3p)+len(path_5p)-1, len(best_overlap_graph))
print "The begin of the 5'-path is", path_5p[0]
print "The end of the 3'-path is", path_3p[-1]
The longest fragment is 0x4eaa5fa7_0001083
The number of the fragment of the 5' path is 2221
The number of the fragment of the 3' path is 2158
The total number of framgent of both 3' and 5' path is 4378.
4378 is greater than the total number of fragments 3390 in the best_overlap_graph.
The begin of the 5'-path is ('0xa4919d66_0061381', '3p')
The end of the 3'-path is ('0x24a73353_0082329', '3p')

Build a squence database

Since we get the paths that seems covering the whole genome, we can use the paths to construct draft contigs. In order to do that, we need to get the fragment sequences.

In [15]:
seq_db = {}
with open("pre_assembled_reads.fa") as seq_file:
    for l in seq_file:
        l = l.strip()
        if l[0] == ">":
            name = l[1:]
            continue
        else:
            seq_db[name] = l

Build a "draft" contig using the best overlapping paths

The layout_path function goes through a layout path and fill in the base from the fragment sequences. Nothing is really special here. However, it is alwaye tricky to track the right orientation of the fragements. The function also tracks the fragments used in the layout. Each fragement can only be assigned to one layout/contig.

In [16]:
def rc_seq(seq):
    rev_map = dict(zip("acgtACGTNn-","tgcaTGCANn-"))
    return "".join([rev_map[c] for c in seq[::-1]])

def layout_path(full_path, frag_used, out_fn, tig_name):
    
    if len(full_path) == 0 or full_path[0][0] in frag_used:
        return None
    
    if len(full_path) == 1:
        with open("singleton_"+out_fn, "w") as out_seq_file:
            seq = seq_db[full_path[0][0]]
            print >>out_seq_file, ">%s" % ( "singleton_" + tig_name  )
            print >>out_seq_file, seq
            frag_used.add(full_path[0][0])
        return None
    
    first = full_path[0]
    offset = 0
    current_orientation = first[1]
    revserse_orientation = {"5p":"3p", "3p":"5p"}
    tig_seq = []
    frag_in_layout = set()
    frag_in_layout.add(first[0])
    for second in full_path[1:]:
        overlap_hit = overlap_data[first[0]][second[0]]
        
        q_strand, q_start, q_end, q_len = overlap_hit[3]
        t_strand, t_start, t_end, t_len = overlap_hit[4]
        #print overlap_hit, offset, current_orientation
        seq = seq_db[first[0]]
        if current_orientation == "3p":
            seq = rc_seq(seq)
        del tig_seq[offset:-1]
        tig_seq.extend(seq)
        
        if current_orientation == "5p":
            offset += q_start
        elif current_orientation == "3p":
            offset += q_len - q_end
        if t_strand == 1:
           current_orientation  = revserse_orientation[current_orientation] 
        frag_in_layout.add(first[0])
        first = second
        if second[0] in frag_in_layout:
            break
        if second[0] in frag_used:
            break

    seq = seq_db[first[0]]
    if current_orientation == "3p":
        seq = rc_seq(seq)
    del tig_seq[offset:-1]
    tig_seq.extend(seq)
    frag_in_layout.add(first[0])
    
    frag_used.update( frag_in_layout )
    
    tig_seq = tig_seq[0:offset+len(seq)]
    tig_seq = "".join(tig_seq)
    with open(out_fn, "w") as out_seq_file:
        print >>out_seq_file, ">%s" % tig_name
        print >>out_seq_file, tig_seq

Loop through all fragments to generate all contigs and singletons

Ok, ready for the primer time!!

The code below will pick the longest fragment as the first seed and generate a contig from its overlapping path. Then, we will pick another longest one from those reads that have not been used in a contig. This process repeats until no more fragement is avaiable.

In the data set, we generate one big contig (4650011 bp) and 49 singleton reads.

In [17]:
frag_used = set()

all_best_overlap_frags = set(best_overlap_graph.keys())

unused_frag = all_best_overlap_frags - frag_used
i = 0
while len(unused_frag) != 0:

    len_qname = [ (overlap_data[x].length, x) for x in unused_frag  ]
    len_qname.sort()
    q_name = len_qname[-1][1]
    print "iteration: ",i
    print "frag is used?", q_name in frag_used
    if q_name in frag_used:
        continue

    path_5p, s_5p = find_path( (q_name, "5p"), frag_used  )
    path_3p, s_3p = find_path( (q_name, "3p"), frag_used  )
    print "seeding frag:", q_name, "Length:", overlap_data[q_name].length
    print "number of unused frag:", len(unused_frag), "total overlapped frag:", len(path_3p)+len(path_5p)
    print len(path_5p), s_5p, len(path_3p), s_3p
    print "--------------------------"
    #print path_5p[0], path_3p[-1]
    if len(path_5p) + len(path_3p) == 0:
        out_fn = "tig_%05d.fa" % i
        tig_name ="tig%05d" % i
        with open("singleton_"+out_fn, "w") as out_seq_file:
            seq = seq_db[q_name]
            print >>out_seq_file, ">%s" % ( "singleton_" + tig_name  )
            print >>out_seq_file, seq
        frag_used.add(q_name)
        unused_frag = all_best_overlap_frags - frag_used
        i += 1
        continue
        
    if len(path_5p) > 0:
        #assert path_5p[-1] == path_3p[0]

        full_path = path_5p + path_3p[1:]
    else:
        full_path = path_3p
    layout_path(full_path, frag_used, "tig_%05d.fa" % i, "tig%05d" % i)
    unused_frag = all_best_overlap_frags - frag_used
    i += 1
iteration:  0
frag is used? False
seeding frag: 0x4eaa5fa7_0001083 Length: 14762
number of unused frag: 3390 total overlapped frag: 4379
2221 terminate_2 2158 terminate_2
--------------------------
iteration:  1
frag is used? False
seeding frag: 0x720a6867_0101011 Length: 9562
number of unused frag: 64 total overlapped frag: 2
1 frag_used 1 terminate_2
--------------------------
iteration:  2
frag is used? False
seeding frag: 0xbb3ff171_0040732 Length: 8653
number of unused frag: 63 total overlapped frag: 2
1 branch 1 branch
--------------------------
iteration:  3
frag is used? False
seeding frag: 0xf9f76d57_0075336 Length: 8601
number of unused frag: 62 total overlapped frag: 2
1 frag_used 1 frag_used
--------------------------
iteration:  4
frag is used? False
seeding frag: 0x24a73353_0082329 Length: 8529
number of unused frag: 61 total overlapped frag: 2
1 frag_used 1 terminate_2
--------------------------
iteration:  5
frag is used? False
seeding frag: 0x99a801d6_0119226 Length: 8390
number of unused frag: 60 total overlapped frag: 2
1 frag_used 1 frag_used
--------------------------
iteration:  6
frag is used? False
seeding frag: 0x3176999f_0092788 Length: 8012
number of unused frag: 59 total overlapped frag: 2
1 frag_used 1 terminate_2
--------------------------
iteration:  7
frag is used? False
seeding frag: 0xd1f5cb9c_0041132 Length: 7742
number of unused frag: 58 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  8
frag is used? False
seeding frag: 0x6a67d4d1_0054113 Length: 7635
number of unused frag: 57 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  9
frag is used? False
seeding frag: 0x5e62b551_0087810 Length: 7612
number of unused frag: 56 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  10
frag is used? False
seeding frag: 0x47c817fa_0037671 Length: 7592
number of unused frag: 55 total overlapped frag: 2
1 frag_used 1 frag_used
--------------------------
iteration:  11
frag is used? False
seeding frag: 0xd34bbcb8_0096026 Length: 7586
number of unused frag: 54 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  12
frag is used? False
seeding frag: 0xde53c97d_0130793 Length: 7568
number of unused frag: 53 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  13
frag is used? False
seeding frag: 0xbab42971_0019685 Length: 7462
number of unused frag: 52 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  14
frag is used? False
seeding frag: 0x95dcccbf_0021944 Length: 7446
number of unused frag: 51 total overlapped frag: 2
1 frag_used 1 frag_used
--------------------------
iteration:  15
frag is used? False
seeding frag: 0x7b49a930_0025802 Length: 7402
number of unused frag: 50 total overlapped frag: 2
1 frag_used 1 frag_used
--------------------------
iteration:  16
frag is used? False
seeding frag: 0xec2de5ef_0014019 Length: 7335
number of unused frag: 49 total overlapped frag: 2
1 frag_used 1 frag_used
--------------------------
iteration:  17
frag is used? False
seeding frag: 0xb48bdbdf_0062004 Length: 7033
number of unused frag: 48 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  18
frag is used? False
seeding frag: 0x201a3501_0131298 Length: 6986
number of unused frag: 47 total overlapped frag: 2
1 frag_used 1 terminate_2
--------------------------
iteration:  19
frag is used? False
seeding frag: 0xa3d2663c_0086930 Length: 6961
number of unused frag: 46 total overlapped frag: 2
1 frag_used 1 terminate_2
--------------------------
iteration:  20
frag is used? False
seeding frag: 0xe092bfc0_0025000 Length: 6941
number of unused frag: 45 total overlapped frag: 2
1 frag_used 1 terminate_2
--------------------------
iteration:  21
frag is used? False
seeding frag: 0x101cf700_0133174 Length: 6913
number of unused frag: 44 total overlapped frag: 2
1 terminate_2 1 terminate_2
--------------------------
iteration:  22
frag is used? False
seeding frag: 0x27b1b79d_0098081 Length: 6687
number of unused frag: 43 total overlapped frag: 2
1 frag_used 1 frag_used
--------------------------
iteration:  23
frag is used? False
seeding frag: 0x746a0885_0091093 Length: 6663
number of unused frag: 42 total overlapped frag: 2
1 frag_used 1 frag_used
--------------------------
iteration:  24
frag is used? False
seeding frag: 0xf16e1447_0062854 Length: 6502
number of unused frag: 41 total overlapped frag: 2
1 terminate_2 1 terminate_2
--------------------------
iteration:  25
frag is used? False
seeding frag: 0x2e1dde37_0073961 Length: 6444
number of unused frag: 40 total overlapped frag: 2
1 frag_used 1 terminate_2
--------------------------
iteration:  26
frag is used? False
seeding frag: 0x39af8897_0112579 Length: 6439
number of unused frag: 39 total overlapped frag: 2
1 terminate_2 1 terminate_2
--------------------------
iteration:  27
frag is used? False
seeding frag: 0x4252156_0076315 Length: 6326
number of unused frag: 38 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  28
frag is used? False
seeding frag: 0x8b32b9de_0082459 Length: 6255
number of unused frag: 37 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  29
frag is used? False
seeding frag: 0xc0298763_0082661 Length: 6235
number of unused frag: 36 total overlapped frag: 2
1 frag_used 1 frag_used
--------------------------
iteration:  30
frag is used? False
seeding frag: 0xfe2ddc8f_0006613 Length: 6219
number of unused frag: 35 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  31
frag is used? False
seeding frag: 0xf2c9d7d9_0019647 Length: 6209
number of unused frag: 34 total overlapped frag: 2
1 frag_used 1 terminate_2
--------------------------
iteration:  32
frag is used? False
seeding frag: 0x5f497891_0044312 Length: 6178
number of unused frag: 33 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  33
frag is used? False
seeding frag: 0x9d1fd3d5_0003951 Length: 6142
number of unused frag: 32 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  34
frag is used? False
seeding frag: 0xd3405c1f_0120478 Length: 6056
number of unused frag: 31 total overlapped frag: 2
1 frag_used 1 frag_used
--------------------------
iteration:  35
frag is used? False
seeding frag: 0xd84402bc_0135092 Length: 6022
number of unused frag: 30 total overlapped frag: 2
1 terminate_2 1 terminate_2
--------------------------
iteration:  36
frag is used? False
seeding frag: 0x8e9f3eb7_0077492 Length: 6020
number of unused frag: 29 total overlapped frag: 2
1 frag_used 1 terminate_2
--------------------------
iteration:  37
frag is used? False
seeding frag: 0x6560473d_0129910 Length: 6009
number of unused frag: 28 total overlapped frag: 2
1 frag_used 1 terminate_2
--------------------------
iteration:  38
frag is used? False
seeding frag: 0x8e21ba01_0028034 Length: 5945
number of unused frag: 27 total overlapped frag: 2
1 frag_used 1 terminate_2
--------------------------
iteration:  39
frag is used? False
seeding frag: 0x328d5dfc_0076264 Length: 5867
number of unused frag: 26 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  40
frag is used? False
seeding frag: 0x8c4c9750_0115911 Length: 5854
number of unused frag: 25 total overlapped frag: 2
1 frag_used 1 terminate_2
--------------------------
iteration:  41
frag is used? False
seeding frag: 0x99d966a0_0043182 Length: 5848
number of unused frag: 24 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  42
frag is used? False
seeding frag: 0xebaae05c_0001651 Length: 5718
number of unused frag: 23 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  43
frag is used? False
seeding frag: 0x6015c8f3_0137444 Length: 5607
number of unused frag: 22 total overlapped frag: 2
1 frag_used 1 terminate_2
--------------------------
iteration:  44
frag is used? False
seeding frag: 0x820c3016_0097966 Length: 5594
number of unused frag: 21 total overlapped frag: 2
1 frag_used 1 frag_used
--------------------------
iteration:  45
frag is used? False
seeding frag: 0xfb73db8e_0019839 Length: 5523
number of unused frag: 20 total overlapped frag: 2
1 frag_used 1 frag_used
--------------------------
iteration:  46
frag is used? False
seeding frag: 0x5842b8b1_0141185 Length: 5483
number of unused frag: 19 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  47
frag is used? False
seeding frag: 0x2876a699_0009758 Length: 5403
number of unused frag: 18 total overlapped frag: 2
1 frag_used 1 terminate_2
--------------------------
iteration:  48
frag is used? False
seeding frag: 0xf936c2fd_0094203 Length: 5372
number of unused frag: 17 total overlapped frag: 2
1 frag_used 1 terminate_2
--------------------------
iteration:  49
frag is used? False
seeding frag: 0xe5322dba_0130928 Length: 5027
number of unused frag: 16 total overlapped frag: 2
1 frag_used 1 frag_used
--------------------------
iteration:  50
frag is used? False
seeding frag: 0x2ac1140b_0130957 Length: 4894
number of unused frag: 15 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  51
frag is used? False
seeding frag: 0x53d92d4b_0116355 Length: 4758
number of unused frag: 14 total overlapped frag: 2
1 frag_used 1 frag_used
--------------------------
iteration:  52
frag is used? False
seeding frag: 0xe09961a7_0000768 Length: 4746
number of unused frag: 13 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  53
frag is used? False
seeding frag: 0x6bdaa7bf_0024191 Length: 4530
number of unused frag: 12 total overlapped frag: 2
1 terminate_2 1 terminate_2
--------------------------
iteration:  54
frag is used? False
seeding frag: 0x2d6f1fdb_0064584 Length: 4419
number of unused frag: 11 total overlapped frag: 2
1 frag_used 1 terminate_2
--------------------------
iteration:  55
frag is used? False
seeding frag: 0xe6c70a1b_0009216 Length: 4386
number of unused frag: 10 total overlapped frag: 2
1 frag_used 1 frag_used
--------------------------
iteration:  56
frag is used? False
seeding frag: 0x1bed69cc_0052813 Length: 4332
number of unused frag: 9 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  57
frag is used? False
seeding frag: 0xc48ec0c3_0089324 Length: 4173
number of unused frag: 8 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  58
frag is used? False
seeding frag: 0x2afb8806_0097596 Length: 3840
number of unused frag: 7 total overlapped frag: 2
1 terminate_2 1 terminate_2
--------------------------
iteration:  59
frag is used? False
seeding frag: 0xf953ded0_0012908 Length: 2975
number of unused frag: 6 total overlapped frag: 2
1 frag_used 1 terminate_2
--------------------------
iteration:  60
frag is used? False
seeding frag: 0x3b256f1f_0102519 Length: 2397
number of unused frag: 5 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  61
frag is used? False
seeding frag: 0x3f31dab0_0019006 Length: 2040
number of unused frag: 4 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  62
frag is used? False
seeding frag: 0xcc88cea2_0015902 Length: 1651
number of unused frag: 3 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------
iteration:  63
frag is used? False
seeding frag: 0x2708c7d8_0113402 Length: 1617
number of unused frag: 2 total overlapped frag: 2
1 terminate_2 1 terminate_2
--------------------------
iteration:  64
frag is used? False
seeding frag: 0x612f720b_0078835 Length: 1275
number of unused frag: 1 total overlapped frag: 2
1 terminate_2 1 frag_used
--------------------------

What are the singletons?

One can align the contigs to the canoincal E. coli K12 reference to evaluate the quality of the assembly. One can see from the alignment data below, it seems that some fragments have lower accuracy that does not pass the 98% identity threshold to be called "overlapped" with other fragments. Three of them does not have full alginments. In this case, one will need to examine the raw reads that corresponds to these three fragments to see whether they are minor variants or some lower quality reads are not filtered.

In [18]:
!cat singleton_tig_000* > singletons.fa
!blasr singletons.fa ecoli_k12_MG1655.fasta -m 4 -noSplitSubreads -nCandidates 20 -bestn 1 | awk '{print $1" "$2" "$4" "$6" "$7" "$8" "$8-($7-$6)}'
singleton_tig00001 Escherichia_coli_K12 97.4017 0 9562 9562 0
singleton_tig00002 Escherichia_coli_K12 97.3956 1431 8653 8653 1431
singleton_tig00003 Escherichia_coli_K12 98.432 0 8601 8601 0
singleton_tig00004 Escherichia_coli_K12 98.3834 0 8529 8529 0
singleton_tig00005 Escherichia_coli_K12 99.4192 0 8390 8390 0
singleton_tig00006 Escherichia_coli_K12 97.3447 0 8012 8012 0
singleton_tig00007 Escherichia_coli_K12 98.1237 0 7742 7742 0
singleton_tig00008 Escherichia_coli_K12 97.7821 0 7635 7635 0
singleton_tig00009 Escherichia_coli_K12 98.2443 0 7612 7612 0
singleton_tig00010 Escherichia_coli_K12 99.1766 2 7592 7592 2
singleton_tig00011 Escherichia_coli_K12 97.1927 0 7586 7586 0
singleton_tig00012 Escherichia_coli_K12 96.5121 0 7558 7568 10
singleton_tig00013 Escherichia_coli_K12 97.4269 0 7462 7462 0
singleton_tig00014 Escherichia_coli_K12 98.3228 0 7446 7446 0
singleton_tig00015 Escherichia_coli_K12 99.4492 0 7402 7402 0
singleton_tig00016 Escherichia_coli_K12 98.9734 0 7335 7335 0
singleton_tig00017 Escherichia_coli_K12 97.9114 0 7033 7033 0
singleton_tig00018 Escherichia_coli_K12 97.6773 0 6986 6986 0
singleton_tig00019 Escherichia_coli_K12 96.437 0 6961 6961 0
singleton_tig00020 Escherichia_coli_K12 97.4028 0 6941 6941 0
singleton_tig00021 Escherichia_coli_K12 97.5844 0 6913 6913 0
singleton_tig00022 Escherichia_coli_K12 99.1106 0 6687 6687 0
singleton_tig00023 Escherichia_coli_K12 98.9452 0 6663 6663 0
singleton_tig00024 Escherichia_coli_K12 97.4351 0 6502 6502 0
singleton_tig00025 Escherichia_coli_K12 97.3736 0 6444 6444 0
singleton_tig00026 Escherichia_coli_K12 96.5362 0 6439 6439 0
singleton_tig00027 Escherichia_coli_K12 97.9105 0 6326 6326 0
singleton_tig00028 Escherichia_coli_K12 97.7177 0 6255 6255 0
singleton_tig00029 Escherichia_coli_K12 98.0028 0 6235 6235 0
singleton_tig00030 Escherichia_coli_K12 98.09 0 6219 6219 0
singleton_tig00031 Escherichia_coli_K12 95.9777 0 6209 6209 0
singleton_tig00032 Escherichia_coli_K12 97.8922 0 6178 6178 0
singleton_tig00033 Escherichia_coli_K12 97.8164 0 6142 6142 0
singleton_tig00034 Escherichia_coli_K12 99.2136 0 6056 6056 0
singleton_tig00035 Escherichia_coli_K12 97.2532 0 6022 6022 0
singleton_tig00036 Escherichia_coli_K12 97.854 0 6020 6020 0
singleton_tig00037 Escherichia_coli_K12 97.1526 0 6009 6009 0
singleton_tig00038 Escherichia_coli_K12 96.5394 0 5945 5945 0
singleton_tig00039 Escherichia_coli_K12 97.2628 0 5867 5867 0
singleton_tig00040 Escherichia_coli_K12 97.3697 0 5854 5854 0
singleton_tig00041 Escherichia_coli_K12 97.9883 0 5848 5848 0
singleton_tig00042 Escherichia_coli_K12 97.7751 0 5718 5718 0
singleton_tig00043 Escherichia_coli_K12 97.9378 0 5607 5607 0
singleton_tig00044 Escherichia_coli_K12 99.2899 0 5594 5594 0
singleton_tig00045 Escherichia_coli_K12 98.4484 0 5523 5523 0
singleton_tig00046 Escherichia_coli_K12 97.7524 0 5483 5483 0
singleton_tig00047 Escherichia_coli_K12 83.9108 0 5403 5403 0
singleton_tig00048 Escherichia_coli_K12 97.5472 0 5372 5372 0
singleton_tig00049 Escherichia_coli_K12 97.8931 0 5027 5027 0
singleton_tig00050 Escherichia_coli_K12 97.9968 0 4894 4894 0
singleton_tig00051 Escherichia_coli_K12 97.739 0 4758 4758 0
singleton_tig00052 Escherichia_coli_K12 94.8541 58 4746 4746 58
singleton_tig00053 Escherichia_coli_K12 97.3543 0 4530 4530 0
singleton_tig00054 Escherichia_coli_K12 95.1304 0 4419 4419 0
singleton_tig00055 Escherichia_coli_K12 96.9268 0 4386 4386 0
singleton_tig00056 Escherichia_coli_K12 97.6502 0 4332 4332 0
singleton_tig00057 Escherichia_coli_K12 97.4755 0 4173 4173 0
singleton_tig00058 Escherichia_coli_K12 97.8816 0 3840 3840 0
singleton_tig00059 Escherichia_coli_K12 97.7303 0 2975 2975 0
singleton_tig00060 Escherichia_coli_K12 96.2158 0 2397 2397 0
singleton_tig00061 Escherichia_coli_K12 97.5586 0 2040 2040 0
singleton_tig00062 Escherichia_coli_K12 97.5768 0 1651 1651 0
singleton_tig00063 Escherichia_coli_K12 97.8221 0 1617 1617 0
singleton_tig00064 Escherichia_coli_K12 97.032 0 1275 1275 0

How good is the main contig?

The main contig is 4650011bp. It seems that it should cover the whole genome. We can see the whole genome alignment using gepard. We can see there is no larger scale mis-assembly. Promising!! (The strand of the K12 for this data is actuall slight different from the canonical one. We do expect to see some small structure variations.)

In [19]:
Image(filename="tig_align.jpeg")
Out[19]:

Accuracy?

We don't expect to get high accuracy from the "draft" assembly. The way we construct the contig is just to copy-and-paste from the pre-assembled reads which have mean identity 99.789% to the reference genome. For a 4.7Mb genome, we will expect 4700000 * (1-0.99787) ~ 10k differences. One can do another round of consensus using the pre-assembled reads. Or, we can align the raw reads on top of the draft contig and apply the "Quiver" algorithm to get the best accuracy from all raw data. Let's see how "Quiver" helps to remove the errors in the draft contig.

First, let's evalute the differece between the draft contigs to the canonical reference with the dnadiff from Mummer3 package

In [20]:
#!dnadiff ecoli_k12_MG1655.fasta tig_00000.fa -p diff_tig0
!echo the number of SNPs is `cat diff_tig0.snps  | wc | awk '{print $1}'`
the number of SNPs is 10026

The draft contig has about 10k SNPs as what we expect from the estimation earlier. We apply Quvier consensus algorithm on the the draft contig using the following script::

#!/bin/bash
export SEYMOUR_HOME=/PathTo/PacBio/SMRTAnalysis
. $SEYMOUR_HOME/etc/setup.sh 
cd /Path/To/WorkingDirectory
cp tig_00000.fa asm.ctg.fasta
referenceUploader -c -p $PWD -n assembly -f asm.ctg.fasta --skipIndexUpdate

compareSequences.py --info --useGuidedAlign --algorithm=blasr --nproc=24 --noXML --h5mode=w --h5fn=out.cmp.h5 --minAccuracy=0.70 --minLength=200 \
-x -nCandidates 50 -x -minMatch 12 -x -bestn 1 -x -minPctIdentity 70.0 /Path/To/WorkingDirectory/input.fofn assembly/

loadPulses /Path/To/WorkingDirectory/input.fofn out.cmp.h5\
-metrics DeletionQV,IPD,InsertionQV,PulseWidth,QualityValue,MergeQV,SubstitutionQV,DeletionTag -byread \

cmph5tools.py sort out.cmp.h5 --tmp /scratch

variantCaller.py --algorithm quiver -j 16 --referenceFilename assembly/sequence/assembly.fasta \
--parameters best -o output.gff  -o output.fasta -o output.fastq -q 0  -X 80 -x 5 --mapQvThreshold 0 out.cmp.h5

This will generate the consensus sequence using Quiver. The consensus output is in output.fasta. We find 26 SNPs in the final consensus results. This gives a lower bound of the final assembly phred QV about -10*log10(26/4639675) = 52.5. One can see that the Quiver algorithm which utilizes all important information from the raw trace signal is able to correct the ~10k errors in the draft assembly. This allows us more freedom while constructing the initial draft assembly. We do not need to construct perfect draft assembly from perfect reads. The goal for constructing the draft is to get the contiguity right. Given the long reads, one can expect that most mapping is unambiguous and that will lead to almost perfect consensus using Quiver.

In [21]:
#dnadiff ecoli_k12_MG1655.fasta output.fasta -p diff_quiver_tig0
!echo the number of SNPs is `cat diff_quiver_tig0.snps | wc | awk '{print $1}'`
from math import log
print "The phred scale QV of the assembly is about %0.1f" % (-10*log(26.0/4639675)/log(10))
print "The concordnace is about %0.5f%%" % (100*(1-26.0/4639675))
the number of SNPs is 26
The phred scale QV of the assembly is about 52.5
The concordnace is about 99.99944%

Final Thought

I had some fun with this exercise. I think this approach is general but it will need more tests for sure. We only have n=1 successful story for now. However, the reality is when we can generate longer and longer reads, the assembly problem will become easier and easier. DNA sequencing is about connecting different bases across distance. Some problems can be only solved by enough read length spans. I do like to see more fundamental theories published about genome assembly and haplotyping (e.g. the classical Lander-Waterman paper and the recent careful analysis done (preprint) by Davie Tse's group). Getting good consensus from long reads is probably a relative simpler problem than dealing with combinatorial explosion when one tries to infer long range information using short reads. Taking bacteria assembly as an example, it is likely impossible to assemble the E. coli K12 genome to single contig with single-end short read library. Without any long pair-end or jumping libraries, the single-end short read data simply does not the power to resolve various scales of repeat structure in the genome. Such constraints can not be solved by simply sequencing more (see Tse's argument). Actually, regardless how cheap DNA sequencing is, without proper understanding the fundamental theories and mathematical constraints, one could just waste money collecting tons of useless data. Cheap data still costs money. Cheap data that can not solve problems is actually expansive. One can sequence E coli. K12 genome to 10000x coverage with short read technologies cheaply. However, without proper libraries to get long range information, 10000x coverage from short fragments will still lead to fragmented assemblies. One the other hand, one can see from the example in this IPython Notebook. Once we can generate good quality long reads, even an assembly newbie like me can put together a prototype of an assembler with python in a few days. By the way, it only takes 20 seconds on my 2012 MacBook Air (Intel i7) to get the draft assembly from the m4 file.

In [22]:
!time python Write_An_Assembler.py > /dev/null
real	0m21.142s
user	0m20.326s
sys	0m0.745s