In this recipe we look at how to find the longest run of a specified set of characters in a Sequence
object. In this particular case we'll work with DNA
sequences, and look for runs of purines and pyrimidines. In this context, I use the term run to mean a series of contiguous characters of interest (e.g., purines). We'll obtain the start and end positions of the longest stretch of various character sets.
First, we'll configure the environment:
from skbio import DNA
Next, let's load a single sequence from a fasta file. In this case, this is a short 16S sequence. In practice, you might be loading a whole genome here, but the process will be the same.
dna = DNA.read('data/single_sequence1.fasta', seq_num=1)
dna
DNA ---------------------------------------------------------------------- Metadata: 'description': '' 'id': '229854' Stats: length: 1541 has gaps: False has degenerates: True has definites: True GC-content: 52.30% ---------------------------------------------------------------------- 0 GAGTTTGATC CTGGCTCAGA TTGAACGCTG GCGGCATGCT TAACACATGC AAGTCGAACG 60 GCAGCATGAC TTAGCTTGCT AAGTTGATGG CGAGTGGCGA ACGGGTGAGT AACGCGTAGG ... 1440 CCATGGGAGT GGGCTGCACC AGAAGTAGAT AGTCTAACCG CAAGGGGGAC GTTTACCACG 1500 GTGTGGTTCA TGACTGGGGT GAAGTCGTAA CAAGGTAGCC G
To find the longest run of purines, we use DNA.find_motifs('purine-run')
to find all purine runs. These come back as python slice
objects, so we can find the start and stop positions of each purine run and then Python's max
function to find the longest. We can also slice our sequence object directly using the resulting slices.
purine_runs = list(dna.find_motifs('purine-run', min_length=2))
longest_purine = max(purine_runs, key=lambda x: x.stop - x.start)
dna[longest_purine]
DNA -------------------------- Metadata: 'description': '' 'id': '229854' Stats: length: 11 has gaps: False has degenerates: False has definites: True GC-content: 54.55% -------------------------- 0 AAAGAGGGGG A
longest_purine
slice(130, 141, None)
To find the longest run of pyrimidines:
pyrimidine_runs = list(dna.find_motifs('pyrimidine-run', min_length=2))
longest_pyrimidine = max(pyrimidine_runs, key=lambda x: x.stop - x.start)
dna[longest_pyrimidine]
DNA -------------------------- Metadata: 'description': '' 'id': '229854' Stats: length: 6 has gaps: False has degenerates: False has definites: True GC-content: 50.00% -------------------------- 0 CTCTTC
longest_pyrimidine
slice(175, 181, None)
To find the longest run of some other character or pattern that don't have built-in support in scikit-bio's sequence objects, we can use find_with_regex
. For example, to find the longest stretch of T
characters:
t_runs = list(dna.find_with_regex('([T]+)'))
longest_t = max(t_runs, key=lambda x: x.stop - x.start)
dna[longest_t]
DNA -------------------------- Metadata: 'description': '' 'id': '229854' Stats: length: 3 has gaps: False has degenerates: False has definites: True GC-content: 0.00% -------------------------- 0 TTT
longest_t
slice(3, 6, None)