from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
'theme': 'serif',
'transition': 'zoom',
'start_slideshow_at': 'selected',
})
from IPython.display import IFrame
dammit is a simple de novo transcriptome annotator. It was born out of the observations that annotation is mundane and annoying, all the individual pieces of the process exist already, and the existing solutions are overly complicated or rely on crappy non-free software.
Science shouldn't suck for the sake of sucking, so dammit attempts to make this sucky part of the process suck a little less.
and of course,
IFrame('http://www.camillescott.org/dammit', width=800, height=400)
An exception is BUSCO, which is on a dodgy lab website and distributed as a tarball. I intent to remember this ;)
The last one is important, and sometimes ignored.
Building off Richard and co's work on Conditional Reciprocal Best BLAST, I've implemented a new version with Python and LAST -- CRBL. The original lives here: https://github.com/cboursnell/crb-blast
And, of course, some of these databases are BIG. Doing blastx
and tblastn
between a reasonably sized transcriptome and Uniref90 is not an experience you want to have.
CRBB attempts to associate those isoforms with appropriate annotations by learning an appropriate e-value cutoff for different transcript lengths.
from http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004365#s5
For CRBL, instead of fitting a linear model, we train a model.
One limitation is that LAST has no equivalent to tblastn
. So, we find the RBHs using the TransDecoder ORFs, and then use the model on the translated transcriptome versus database hits.
Very difficult to establish an appropriate boundary!
http://edhar.genomecenter.ucdavis.edu/~camille/dammit/pom.500.fa.x.pep.fa.crbl.model.fitted.pdf
Need to play with data preparation to make the boundary stand out more. CRBB did this by first running a sliding window over the transcript lengths and finding the centroid e-value, and fitting the resulting points to their model.
dammit is a standard Python package, which means it can used as a library as well.
from dammit.model import CRBL
from dammit.parsers import maf_to_df_iter
import pandas as pd
maf_df = pd.concat([group for group in maf_to_df_iter('pom.500.fa.dammit/pep.fa.x.pom.500.fa.transdecoder.pep.maf')])
maf_df.sort_values(by='q_name').head(n=5)
E | EG2 | q_aln_len | q_len | q_name | q_start | q_strand | s_aln_len | s_len | s_name | s_start | s_strand | score | bitscore | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1031 | 2.200000e-47 | 1.600000e-37 | 301 | 923 | SPAC1002.03c_gls2_I_glucosidase | 533 | + | 294 | 994 | SPAC30D11.01c_1119880_1123773_1_gto2_I_protein... | 612 | + | 405 | 182.042873 |
1029 | 1.700000e-186 | 8.100000e-163 | 492 | 547 | SPAC1002.12c_SPAC1002.12c_I_succinate-semialde... | 55 | + | 493 | 494 | SPAC139.05_1027811_1029993_1_SPAC139.05_I_prot... | 0 | + | 1348 | 598.255639 |
1035 | 8.100000e-09 | 9.900000e+01 | 122 | 417 | SPAC1002.13c_psu1_I_cell | 36 | + | 122 | 531 | SPAC13G6.10c_191928_194053_-1_asl1_I_protein_c... | 140 | + | 113 | 53.162569 |
1041 | 3.900000e-06 | 4.500000e+04 | 101 | 417 | SPAC1002.13c_psu1_I_cell | 32 | + | 97 | 531 | SPAC13G6.10c_191928_194053_-1_asl1_I_protein_c... | 196 | + | 93 | 44.335150 |
1040 | 2.900000e-06 | 3.300000e+04 | 155 | 417 | SPAC1002.13c_psu1_I_cell | 32 | + | 155 | 386 | SPAC1F8.06_99871_101431_1_fta5_I_protein_codin... | 22 | + | 94 | 44.776521 |
CRBL.best_hits(maf_df)
maf_df.head(n=4)
E | EG2 | q_aln_len | q_len | q_name | q_start | q_strand | s_aln_len | s_len | s_name | s_start | s_strand | score | bitscore | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1031 | 2.200000e-47 | 1.600000e-37 | 301 | 923 | SPAC1002.03c_gls2_I_glucosidase | 533 | + | 294 | 994 | SPAC30D11.01c_1119880_1123773_1_gto2_I_protein... | 612 | + | 405 | 182.042873 |
1029 | 1.700000e-186 | 8.100000e-163 | 492 | 547 | SPAC1002.12c_SPAC1002.12c_I_succinate-semialde... | 55 | + | 493 | 494 | SPAC139.05_1027811_1029993_1_SPAC139.05_I_prot... | 0 | + | 1348 | 598.255639 |
1032 | 1.900000e-12 | 2.600000e-02 | 124 | 417 | SPAC1002.13c_psu1_I_cell | 35 | + | 131 | 531 | SPAC13G6.10c_191928_194053_-1_asl1_I_protein_c... | 120 | + | 140 | 65.079583 |
1030 | 4.800000e-137 | 1.000000e-118 | 468 | 499 | SPAC1002.16c_SPAC1002.16c_I_plasma | 8 | + | 469 | 499 | SPAC11D3.18c_144819_146935_-1_SPAC11D3.18c_I_p... | 9 | + | 1016 | 451.720498 |
print maf_to_df_iter.__doc__
Iterator yielding DataFrames of length chunksize holding MAF alignments. An extra column is added for bitscore, using the equation described here: http://last.cbrc.jp/doc/last-evalues.html Args: fn (str): Path to the MAF alignment file. chunksize (int): Alignments to parse per iteration. Yields: DataFrame: Pandas DataFrame with the alignments.
The the documentation for more: http://www.camillescott.org/dammit/py-modindex.html
Thanks to Titus for advice, input, and PRs, Chris Hamm and Matt MacManes for filing issues, Michael for advice, Richard for his nifty method, and the terrible state of the bioinformatics industry for inspiring me.