In this recipe we look at how to find the longest run of a specified set of characters in a Sequence object. In this particular case we'll work with DNA sequences, and look for runs of purines and pyrimidines. In this context, I use the term run to mean a series of contiguous characters of interest (e.g., purines). We'll obtain the start and end positions of the longest stretch of various character sets.

First, we'll configure the environment:

In [1]:
from skbio import DNA

Next, let's load a single sequence from a fasta file. In this case, this is a short 16S sequence. In practice, you might be loading a whole genome here, but the process will be the same.

In [2]:
dna = DNA.read('data/single_sequence1.fasta', seq_num=1)
dna
Out[2]:
DNA
----------------------------------------------------------------------
Metadata:
    'description': ''
    'id': '229854'
Stats:
    length: 1541
    has gaps: False
    has degenerates: True
    has definites: True
    GC-content: 52.30%
----------------------------------------------------------------------
0    GAGTTTGATC CTGGCTCAGA TTGAACGCTG GCGGCATGCT TAACACATGC AAGTCGAACG
60   GCAGCATGAC TTAGCTTGCT AAGTTGATGG CGAGTGGCGA ACGGGTGAGT AACGCGTAGG
...
1440 CCATGGGAGT GGGCTGCACC AGAAGTAGAT AGTCTAACCG CAAGGGGGAC GTTTACCACG
1500 GTGTGGTTCA TGACTGGGGT GAAGTCGTAA CAAGGTAGCC G

To find the longest run of purines, we use DNA.find_motifs('purine-run') to find all purine runs. These come back as python slice objects, so we can find the start and stop positions of each purine run and then Python's max function to find the longest. We can also slice our sequence object directly using the resulting slices.

In [3]:
purine_runs = list(dna.find_motifs('purine-run', min_length=2))
longest_purine = max(purine_runs, key=lambda x: x.stop - x.start)
In [4]:
dna[longest_purine]
Out[4]:
DNA
--------------------------
Metadata:
    'description': ''
    'id': '229854'
Stats:
    length: 11
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 54.55%
--------------------------
0 AAAGAGGGGG A
In [5]:
longest_purine
Out[5]:
slice(130, 141, None)

To find the longest run of pyrimidines:

In [6]:
pyrimidine_runs = list(dna.find_motifs('pyrimidine-run', min_length=2))
longest_pyrimidine = max(pyrimidine_runs, key=lambda x: x.stop - x.start)
In [7]:
dna[longest_pyrimidine]
Out[7]:
DNA
--------------------------
Metadata:
    'description': ''
    'id': '229854'
Stats:
    length: 6
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 50.00%
--------------------------
0 CTCTTC
In [8]:
longest_pyrimidine
Out[8]:
slice(175, 181, None)

To find the longest run of some other character or pattern that don't have built-in support in scikit-bio's sequence objects, we can use find_with_regex. For example, to find the longest stretch of T characters:

In [9]:
t_runs = list(dna.find_with_regex('([T]+)'))
longest_t = max(t_runs, key=lambda x: x.stop - x.start)
In [10]:
dna[longest_t]
Out[10]:
DNA
--------------------------
Metadata:
    'description': ''
    'id': '229854'
Stats:
    length: 3
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 0.00%
--------------------------
0 TTT
In [11]:
longest_t
Out[11]:
slice(3, 6, None)