Unpaired assembly challenge

You will implement software to assemble a genome from synthetic reads. We supply Python code snippets that you might use or adapt in your solutions, but you don't have to.

Part 1: Get and parse the reads

10% of the points for this question

Download the reads:

http://www.cs.jhu.edu/~langmea/resources/f2020_hw4_reads.fa

All the reads come from the same synthetic genome and each is 100 nt long. For simplicity, these reads don't have any quality values.

The following Python code will download the data to a file called reads.fa in the current directory. (Caveat: I don't think the code below works in Python 3. Sorry about that. Go here for details on how to fix: http://python-future.org/compatible_idioms.html#urllib-module.)

In [ ]:
# Download the file containing the reads to "reads.fa" in current directory
! wget http://www.cs.jhu.edu/~langmea/resources/f2020_hw4_reads.fa

# Following line is so we can see the first few lines of the reads file
# from within IPython -- don't paste this into your Python code
! head f2020_hw4_reads.fa

Or you can download them manually from your browser.

Part 2: Build an overlap graph

40% of the points for this question

Goal: Write a file containing each read's best buddy to the right. Let's define that.

For each read $A$, find the other read $B$ that has the longest suffix/prefix match with $A$, i.e. a suffix of $A$ matches a prefix of $B$. $B$ is $A$'s best buddy to the right. However, if there is a tie, or if the longest suffix/prefix match is less than 40 nucleotides long, then $A$ has no best buddy to the right. For each read, your program should output either (a) nothing, if there is no best buddy to the right, or (b) a single, space-separated line with the IDs of $A$ and $B$ and the length of the overlap, like this:

0255/2 2065/1 88

This indicates a 88 bp suffix of the read with ID 0255/2 is a prefix of the read with ID 2065/1. Because of how we defined best buddy, it also means no other read besides 2065/1 has a prefix of 88+ bp that is also a suffix of read 0255/2. A corrolary of this is that a particular read ID should appear in the first column of your program's output at most once. Also, since we require the overlap to be at least 40 bases long, no number less than 40 should every appear in the last column.

Notes:

  • You can assume all reads are error-free and from the forward strand. You do not need to consider sequencing errors or reverse complements.
  • Below is a hint that can make things speedier.
  • The order of the output lines is not important.

Hint 1: the following function groups reads such that you can avoid comparing every read to every other read when looking for suffix/prefix matches. It builds a dictionary called table where the keys are k-mers and the values are sets containing the names of all reads containing that k-mer. Since you are looking for overlaps of length at least 40, you only need to compare reads if they have at least 1 40-mer in common.

In [ ]:
def make_kmer_table(seqs, k):
    ''' Given dictionary (e.g. output of parse_fasta) and integer k,
        return a dictionary that maps each k-mer to the set of names
        of reads containing the k-mer. '''
    table = {}  # maps k-mer to set of names of reads containing k-mer
    for name, seq in seqs.items():
        for i in range(0, len(seq) - k + 1):
            kmer = seq[i:i+k]
            if kmer not in table:
                table[kmer] = set()
            table[kmer].add(name)
    return table

Hint 2: here's a function for finding suffix/prefix matches; we saw this in class:

In [ ]:
def suffixPrefixMatch(str1, str2, min_overlap):
    ''' Returns length of longest suffix of str1 that is prefix of
        str2, as long as that suffix is at least as long as min_overlap. '''
    if len(str2) < min_overlap: return 0
    str2_prefix = str2[:min_overlap]
    str1_pos = -1
    while True:
        str1_pos = str1.find(str2_prefix, str1_pos + 1)
        if str1_pos == -1: return 0
        str1_suffix = str1[str1_pos:]
        if str2.startswith(str1_suffix): return len(str1_suffix)

Part 3: Build unitigs

50% of the points for this question

Goal: Write a program that takes the output of the overlap program from part 1 and creates uniquely assemblable contigs (unitigs), using the best buddy algorithm described below.

We already determined each read's best buddy to the right. I'll abbreviate this as bbr. We did not attempt to compute each read's best buddy to the left (bbl), but we can infer it from the bbrs. Consider the following output:

A B 60
E A 40
C B 70
D C 40

$A$'s bbr is $B$. But $B$'s bbl is $C$, not $A$! Your program should form unitigs by joining together two reads $X$ and $Y$ if they are mutual best buddies. $X$ and $Y$ are mutual best buddies if $X$'s bbr is $Y$ and $Y$'s bbl is $X$, or vice versa. In this example, we would join $D$, $C$, and $B$ into a single unitig (and in that order), and would join reads $E$ and $A$ into a single unitig (also in that order).

Your program's output should consist of several entries like the following, with one entry per unitig:

START UNITIG 1 D
  C 40
  B 70
END UNITIG 1
START UNITIG 2 E
  A 40
END UNITIG 2

The first entry represents a unitig with ID 1 consisting of 3 reads. The first (leftmost) read is D. The second read, C, has a 40 nt prefix that is a suffix of the previous read (D). The third (rightmost) read in the contig (B) has a 70 bp prefix that is a suffix of the previous read (C).

Each read should be contained in exactly one unitig. The order of unitigs in the file is not important, but the unitig IDs should be integers and assigned in ascending order.

Note: we will never provide an input that can result in a circular unitig (i.e. one where a chain of mutual best buddies loops back on itself.)

Hint: the correct solution consists of exactly 4 unitigs.

OPTIONAL Part 4: Finish the assembly

This part is optional. You can submit your solution if you like. No extra credit will be awarded.

Goal: Assemble the genome! Report the sequence of the original genome as a FASTA file.

This requires that you compare the unitigs to each other, think about what order they must go in, and then put them together accordingly. Submit your solution as a single FASTA file containing a single sequence named "solution". The FASTA file should be "wrapped" so that no line has more than 60 characters. You can use the following Python code to write out your answer.

In [ ]:
import sys

def write_solution(genome, per_line=60, out=sys.stdout):
    offset = 0
    out.write('>solution\n')
    while offset < len(genome):
        nchars = min(len(genome) - offset, per_line)
        line = genome[offset:offset+nchars]
        offset += nchars
        out.write(line + '\n')

Here is an example of what your output should look like. Note how the sequence is spread over many lines.

In [ ]:
import random
random.seed(5234)
write_solution(''.join([random.choice('ACGT') for _ in range(500)]))

Hint 1: the correct genome is 7959 nucleotides long

Hint 2: to learn how the unitigs should go together, try overlapping them with each other

Hint 3: a unitig might be used more than once in the solution

In [ ]: