Pretty-printing DNA and protein sequences with monoseq

monoseq is a Python library for pretty-printing DNA and protein sequences using a monospace font. It also provides a simple command line interface.

Sequences are pretty-printed in the traditional way using blocks of letters where each line is prefixed with the sequence position. User-specified regions are highlighted and the output format can be HTML or plaintext with optional styling using ANSI escape codes for use in a terminal.

Here we show how monoseq can be used in the IPython Notebook environment. See the monoseq documentation for more.

Note: Some applications (e.g., GitHub) will not show the annotation styling in this notebook. View this notebook on nbviewer to see all styling.

Use in the IPython Notebook

If you haven't already done so, install monoseq using pip.

pip install monoseq

The monoseq.ipynb module provides Seq, a convenience wrapper around monoseq.pprint_sequence providing easy printing of sequence strings in an IPython Notebook.

In [1]:
from monoseq.ipynb import Seq

s = ('cgcactcaaaacaaaggaagaccgtcctcgactgcagaggaagcaggaagctgtc'
     'ggcccagctctgagcccagctgctggagccccgagcagcggcatggagtccgtgg'
     'ccctgtacagctttcaggctacagagagcgacgagctggccttcaacaagggaga'
     'cacactcaagatcctgaacatggaggatgaccagaactggtacaaggccgagctc'
     'cggggtgtcgagggatttattcccaagaactacatccgcgtcaag')

Seq(s)
Out[1]:
  1  cgcactcaaa acaaaggaag accgtcctcg actgcagagg aagcaggaag ctgtcggccc
 61  agctctgagc ccagctgctg gagccccgag cagcggcatg gagtccgtgg ccctgtacag
121  ctttcaggct acagagagcg acgagctggc cttcaacaag ggagacacac tcaagatcct
181  gaacatggag gatgaccaga actggtacaa ggccgagctc cggggtgtcg agggatttat
241  tcccaagaac tacatccgcg tcaag

Block and line lengths

We can change the number of characters per block and the number of blocks per line.

In [2]:
Seq(s, block_length=8, blocks_per_line=8)
Out[2]:
  1  cgcactca aaacaaag gaagaccg tcctcgac tgcagagg aagcagga agctgtcg gcccagct
 65  ctgagccc agctgctg gagccccg agcagcgg catggagt ccgtggcc ctgtacag ctttcagg
129  ctacagag agcgacga gctggcct tcaacaag ggagacac actcaaga tcctgaac atggagga
193  tgaccaga actggtac aaggccga gctccggg gtgtcgag ggatttat tcccaaga actacatc
257  cgcgtcaa g

Annotations

Let's say we want to highlight two subsequences because they are conserved between species. We define each region as a tuple start,stop (zero-based, stop not included) and include this in the annotation argument.

In [3]:
conserved = [(11, 37), (222, 247)]

Seq(s, annotations=[conserved])
Out[3]:
  1  cgcactcaaa acaaaggaag accgtcctcg actgcagagg aagcaggaag ctgtcggccc
 61  agctctgagc ccagctgctg gagccccgag cagcggcatg gagtccgtgg ccctgtacag
121  ctttcaggct acagagagcg acgagctggc cttcaacaag ggagacacac tcaagatcct
181  gaacatggag gatgaccaga actggtacaa ggccgagctc cggggtgtcg agggatttat
241  tcccaagaac tacatccgcg tcaag

As a contrived example to show several levels of annotation, let's also annotate every 12th character and the middle third of the sequence.

In [4]:
twelves = [(p, p + 1) for p in range(11, len(s), 12)]
middle = [(len(s) / 3, len(s) / 3 * 2)]

Seq(s, annotations=[conserved, twelves, middle])
Out[4]:
  1  cgcactcaaa acaaaggaag accgtcctcg actgcagagg aagcaggaag ctgtcggccc
 61  agctctgagc ccagctgctg gagccccgag cagcggcatg gagtccgtgg ccctgtacag
121  ctttcaggct acagagagcg acgagctggc cttcaacaag ggagacacac tcaagatcct
181  gaacatggag gatgaccaga actggtacaa ggccgagctc cggggtgtcg agggatttat
241  tcccaagaac tacatccgcg tcaag

Custom styling

The default CSS that is applied can be overridden with the style argument.

In [5]:
style = """
{selector} {{ background: beige; color: gray }}
{selector} .monoseq-margin {{ font-style: italic; color: green }}
{selector} .monoseq-annotation-0 {{ color: blue; font-weight: bold }}
"""

Seq(s, style=style, annotations=[conserved])
Out[5]:
  1  cgcactcaaa acaaaggaag accgtcctcg actgcagagg aagcaggaag ctgtcggccc
 61  agctctgagc ccagctgctg gagccccgag cagcggcatg gagtccgtgg ccctgtacag
121  ctttcaggct acagagagcg acgagctggc cttcaacaag ggagacacac tcaagatcct
181  gaacatggag gatgaccaga actggtacaa ggccgagctc cggggtgtcg agggatttat
241  tcccaagaac tacatccgcg tcaag

See the string in monoseq.ipynb.DEFAULT_STYLE for a longer example.