Personal Genomic Toolchain

Hello, my name is Dakota. This is week 4 of my work on the Personal Genomic toolchain. The purpose of this project is to develop tools to process raw genome sequence reads quickly to answer questions like, "What are the most common sequences whose calls are all above a given confidence threshold?"

Why this matters

Genetics can influence risk factors for diseases like heart disease, one of the world’s leading causes of illness and death.

Further, personal genomics is already a $10 billion industry[1] with companies such as 23andme assessing a person's genetic risk factors.

Processing this data faster can lead to increased revenue and make data driven business models cheaper. This could also lead to cheaper, more portable diaganostic tools


The central goal of this project is to process raw genome sequence reads as fast as possible with a goal of processing a typical sequence read in minutes.

Another is to reduce the resources required to perform this analysis, so that more data can be analyzed with less cost.

This project also takes confidence values for the calls into account to increase confidence in the conclusions.

The ultimate goal is to identify common structures among genomes which can inform genome sequencing.

Project Distinctions

Python's native multiprocessing module allows the code to be parallelized and fit into a MapReduce paradigm which can be advantageous for scaling to distributed platforms such as Hadoop.

The question driving this project is a local one, and taking advantage of this enables the use the MapReduce paradigm.

Further optimizations can be made based on the fact that the data is stored in a text format to which Python strings are well-suited.

These aspects combine to produce a framework which achieves the goals of the project.

Project Results

In the first week of the project, my naive code could process about 1Mb of data, corresponding to about 500 thousand calls, in about 40 seconds.

Once I implemented the MapReduce paradigm with Python Multiprocessing, the code could process about 10 Mb of data in about 30 seconds.

Further optimizations, including a heavier reliance on Python string processing and use of the Counter object produced code which can process 1 Gb of data, or 500 million calls, in ten seconds.

At this point, memory is becoming a limiting factor almost as important as time. Solving this issue will be a good step toward moving the processing off a single machine and to a distributed platform.

These improvements are all based on running different code on the same machine, using data from the 1000 genomes project which I will discuss in more detail in a moment.

This project identifies two clear behaviors in the read data based on the confidence threshhold.

Other Applications

Where can these ideas be applied in other areas?

One place is ngram intent analysis where each ngram is given a fitness score is a very close problem to this one.

This can also be used to enhance recommendation engines: given the user ratings of the shows I watched today and my history, what movie should I watch next?

Other examples include co-presence cluster identification and general prediction models.

Read Sequence

One of the challenges of this project is the sheer size of the read data.

A read is a sequence of of bases A, C, G and T coupled with confidence values.

A typical raw read from the 1000 genomes project consists of billions of base pairs and is around 60 Gb of data.

A genome is, in a sense, a fingerprint for a species. The reference genome for humans from 1000 genomes is about 3 Gb.

Individual bases are difficult to distinguish, and this affects the confidence values.

Partial words

Partial words are sequences of calls possibly containing blanks.

For this project, the calls whose confidence is below a given threshold, say 50, are converted to blanks.

The frequency of these are then tabulated using a sliding window of a given length.


At lower confidences longer words are more likely, but at higher confidence values more blanks are called, so the longer words break up and become more infrequent.

These graphs of lower and higher confidence threshholds show this behavior. The long sequences at the lower confidence become much less frequent at higher confidences and slide off this graph.

In [0]:

Next Steps

The next steps of this project is to account for read data which does not fit into the machine's memory, then to adapt the code for use on a platform such as Amazon Elastic Map Reduce.

A further step is to implement genome alignment to apply this method to genomic analysis.

Thank you

These slides, script and code for this presentation are available for download from the project website and there are links to them in the description.