In [2]:

from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
              'theme': 'serif',
              'transition': 'zoom',
              'start_slideshow_at': 'selected',
})
from IPython.display import IFrame

dammit!¶

a simple transcriptome annotator¶

Camille Scott¶

November 18, 2015¶

DIB Lab Meeting¶

TL;DR¶

dammit is a simple de novo transcriptome annotator. It was born out of the observations that annotation is mundane and annoying, all the individual pieces of the process exist already, and the existing solutions are overly complicated or rely on crappy non-free software.

Science shouldn't suck for the sake of sucking, so dammit attempts to make this sucky part of the process suck a little less.

Motivations¶

Annotation is a major component of the sea lamprey project
Many of the pieces of dammit were already implemented
No easy to use solutions
No solutions with good licensing

What should a good annotator even look like?¶

It should be easy to install and upgrade
It should only use Free software
It should make use of standard databases
It should output in reasonable formats
It should be relatively fast

and of course,

it should try to be correct!

Without further ado...¶

In [37]:

IFrame('http://www.camillescott.org/dammit', width=800, height=400)

Out[37]:

The Obligatory Flowchart¶

The Workflow

Software Used¶

TransDecoder
BUSCO
HMMER
Infernal
LAST
crb-blast (for now)
pydoit (under the hood)

All of these are Free Software, as in freedom and beer¶

Just as important, they're all accessible¶

they're maintained, and
relatively easy to install

An exception is BUSCO, which is on a dodgy lab website and distributed as a tarball. I intent to remember this ;)

Databases¶

Pfam-A
Rfam
OrthoDB
BUSCO databases
Uniref90
User-supplied protein databases

The last one is important, and sometimes ignored.

Conditional Reciprocal Best LAST¶

Building off Richard and co's work on Conditional Reciprocal Best BLAST, I've implemented a new version with Python and LAST -- CRBL. The original lives here: https://github.com/cboursnell/crb-blast

Why??¶

BLAST is too slooooooow
Ruby is yet another dependency to have users install
With Python and scikit learn, I have freedom to toy with models (and learn stuff)

And why does speed matter?¶

deluge

And, of course, some of these databases are BIG. Doing blastx and tblastn between a reasonably sized transcriptome and Uniref90 is not an experience you want to have.

ie, practical concerns.¶

A brief intro to CRBB¶

Reciprocal Best Hits (RBH) is a standard method for ortholog detection
Transcriptomes have multiple multiple transcript isoforms, which confounds RBH
CRBB uses machine learning to get at this problem

RBH in the presence of isoforms¶

RBH

CRBB attempts to associate those isoforms with appropriate annotations by learning an appropriate e-value cutoff for different transcript lengths.

CRBB

from http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004365#s5

CRBL¶

For CRBL, instead of fitting a linear model, we train a model.

SVM
Naive bayes

One limitation is that LAST has no equivalent to tblastn. So, we find the RBHs using the TransDecoder ORFs, and then use the model on the translated transcriptome versus database hits.

Challenges¶

Very difficult to establish an appropriate boundary!

http://edhar.genomecenter.ucdavis.edu/~camille/dammit/pom.500.fa.x.pep.fa.crbl.model.fitted.pdf

sigh

Need to play with data preparation to make the boundary stand out more. CRBB did this by first running a sliding window over the transcript lengths and finding the centroid e-value, and fitting the resulting points to their model.

dammit as a library¶

dammit is a standard Python package, which means it can used as a library as well.

In [29]:

from dammit.model import CRBL
from dammit.parsers import maf_to_df_iter
import pandas as pd

maf_df = pd.concat([group for group in maf_to_df_iter('pom.500.fa.dammit/pep.fa.x.pom.500.fa.transdecoder.pep.maf')])
maf_df.sort_values(by='q_name').head(n=5)

Out[29]:

	E	EG2	q_aln_len	q_len	q_name	q_start	q_strand	s_aln_len	s_len	s_name	s_start	s_strand	score	bitscore
1031	2.200000e-47	1.600000e-37	301	923	SPAC1002.03c_gls2_I_glucosidase	533	+	294	994	SPAC30D11.01c_1119880_1123773_1_gto2_I_protein...	612	+	405	182.042873
1029	1.700000e-186	8.100000e-163	492	547	SPAC1002.12c_SPAC1002.12c_I_succinate-semialde...	55	+	493	494	SPAC139.05_1027811_1029993_1_SPAC139.05_I_prot...	0	+	1348	598.255639
1035	8.100000e-09	9.900000e+01	122	417	SPAC1002.13c_psu1_I_cell	36	+	122	531	SPAC13G6.10c_191928_194053_-1_asl1_I_protein_c...	140	+	113	53.162569
1041	3.900000e-06	4.500000e+04	101	417	SPAC1002.13c_psu1_I_cell	32	+	97	531	SPAC13G6.10c_191928_194053_-1_asl1_I_protein_c...	196	+	93	44.335150
1040	2.900000e-06	3.300000e+04	155	417	SPAC1002.13c_psu1_I_cell	32	+	155	386	SPAC1F8.06_99871_101431_1_fta5_I_protein_codin...	22	+	94	44.776521

In [30]:

CRBL.best_hits(maf_df)
maf_df.head(n=4)

Out[30]:

	E	EG2	q_aln_len	q_len	q_name	q_start	q_strand	s_aln_len	s_len	s_name	s_start	s_strand	score	bitscore
1031	2.200000e-47	1.600000e-37	301	923	SPAC1002.03c_gls2_I_glucosidase	533	+	294	994	SPAC30D11.01c_1119880_1123773_1_gto2_I_protein...	612	+	405	182.042873
1029	1.700000e-186	8.100000e-163	492	547	SPAC1002.12c_SPAC1002.12c_I_succinate-semialde...	55	+	493	494	SPAC139.05_1027811_1029993_1_SPAC139.05_I_prot...	0	+	1348	598.255639
1032	1.900000e-12	2.600000e-02	124	417	SPAC1002.13c_psu1_I_cell	35	+	131	531	SPAC13G6.10c_191928_194053_-1_asl1_I_protein_c...	120	+	140	65.079583
1030	4.800000e-137	1.000000e-118	468	499	SPAC1002.16c_SPAC1002.16c_I_plasma	8	+	469	499	SPAC11D3.18c_144819_146935_-1_SPAC11D3.18c_I_p...	9	+	1016	451.720498

The code is even relatively documented...¶

In [36]:

print maf_to_df_iter.__doc__

Iterator yielding DataFrames of length chunksize holding MAF alignments.

    An extra column is added for bitscore, using the equation described here:
        http://last.cbrc.jp/doc/last-evalues.html

    Args:
        fn (str): Path to the MAF alignment file.
        chunksize (int): Alignments to parse per iteration.
    Yields:
        DataFrame: Pandas DataFrame with the alignments.

The the documentation for more: http://www.camillescott.org/dammit/py-modindex.html

Future Work¶

Shoring up the CRBL implementation
Finishing output formatting -- still need to do the FASTA output
Some slight rearrangment of the command line interface

Acknowledgements¶

Thanks to Titus for advice, input, and PRs, Chris Hamm and Matt MacManes for filing issues, Michael for advice, Richard for his nifty method, and the terrible state of the bioinformatics industry for inspiring me.