Modules and I/O


Bioinformatics functions

We've reached a point where you could now tell the computer to do just about anything you'd like with biological data: you have a good grasp of quantitative methods, you know enough Python to write nearly any program, and you have several different tools available to get data into and out of a computing environment. However, just like a spoken language, what you don't have yet is practice - it's impossible to get familiar with moving bioinformatic data around without doing it. This activity, and the next few problem sets, are designed to help provide that practice while introducing a few additional tools to make computational biology even easier and more convenient.

The first few sections here will thus provide some short examples of how functions to perform specific biological tasks might be developed in Python. Then we'll discuss some pre-existing modules that might be useful (did I mention that "pre-existing" means someone else already did the work?), and wrap up with examples of data input and output using the single most common biological data format, a FASTA file for nucleotide (and amino acid) sequences.

Locating binding sites

Suppose you're studying a particular transcription factor binding site of interest, and you'd like to determine where it occurs in a set of upstream promoter sequences. We'll keep things simple and suppose that the TFBS is a single, non-degenerate string of nucleotides (an assumption that we'll learn how to work around soon), and that you've already downloaded your promoter sequences from someplace like BioMart (which we'll discuss in detail in a few weeks). The first step in your investigation of what this TFBS does is identifying in which genes' promoters it occurs.

For the sake of argument, let's say your TFBS sequence is TGACTCA (bonus points if you can identify what TF this corresponds to, there are at least two correct answers), and your promoter DNA sequences (each arising from a particular gene) are:

GGAGTGGTTG ATGACTCATA CTCCATTCGG TTTTTTCGTG CATTGACTCA

The "correct answer" would be to identify that the second and fifth genes' promoters include your TFBS:

GGAGTGGTTG ATGACTCATA CTCCATTCGG TTTTTTCGTG CATTGACTCA

It's really hard to see this type of information using human eyes (even after it's been highlighted!), but it's a great candidate for a task that a computer would be happy to help you with. Python, like any other computational tool, exists to help you tell the computer what tasks you'd like it to help you carry out for your research.

The first step in turning this idea into Python is to define the data you're working with:

  • One input must be a string representing the one or more nucleotides of the binding site motif.

  • One input must be a list of strings representing the nucleotides of each gene's promoter sequences.

  • For the sake of this example, we'll create as an output a list of booleans, of the same length as the input list, indicating True for each promoter that contains the motif and False otherwise.

Thus our anticipated output for the example above is:

In [1]:
[False, True, False, False, True]
Out[1]:
[False, True, False, False, True]

Let's write a function that accepts these inputs and produces the desired output for us. Using our knowledge of Python syntax, we know the appropriate pseudocode (something that has the right semantics but not the right syntax) is:

def bindingSites( strTFBS, astrPromoters ):

    for each promoter string:
        if it contains the TFBS:
            remember True
        else:
            remember False
    return a list of everything we've remembered

Well, you know how to examine each promoter string:

In [2]:
def bindingSites( strTFBS, astrPromoters ):
    
    for strPromoter in astrPromoters:
        pass

And you know how to return a list of results (particularly a list of boolean flag values) after each promoter has been processed:

In [3]:
def bindingSites( strTFBS, astrPromoters ):
    
    for strPromoter in astrPromoters:
        pass
    
    return afResults

How do we fill up that list of results? Well, the key test is whether each particular promoter contains the TFBS, which is well-satisfied by the Python string find function. The find function targets a string, accepts one additional string input argument representing the substring to locate, and returns the integer index of that substring if it's found (starting with 0 as usual) or -1 if it's not located:

In [4]:
"test".find( "es" )
Out[4]:
1
In [5]:
"taste".find( "es" )
Out[5]:
-1
In [6]:
"testy".find( "t" )
Out[6]:
0
In [7]:
"testy".find( "ty")
Out[7]:
3
In [8]:
"This is a test of the emergency broadborking system".find( "bork" )
Out[8]:
37

Thus if a string contains a substring, the find method returns a value of at least 0. Let's use that to continue expanding our pseudocode:

In [9]:
def bindingSites( strTFBS, astrPromoters ):
    
    for strPromoter in astrPromoters:
        if strPromoter.find( strTFBS ) >= 0:
            afResults.append( True )
        else:
            afResults.append( False )
    
    return afResults

This looks like it might be close, but if we try it out on a very simple test case, we have a problem:

In [10]:
bindingSites( "T", ["A", "C", "T", "G"] )
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-2f3390af411d> in <module>()
----> 1 bindingSites( "T", ["A", "C", "T", "G"] )

<ipython-input-9-cc9d235805c6> in bindingSites(strTFBS, astrPromoters)
      5             afResults.append( True )
      6         else:
----> 7             afResults.append( False )
      8 
      9     return afResults

NameError: global name 'afResults' is not defined

In general, when reading Python errors, you should pay attention to two pieces of the garbledegook: what it said went wrong, at the very bottom, and where it said it went wrong (what line of code). Here, Python's telling us that a variable named afResults wasn't defined when we tried to use it on line 7. It might be hard to tell exactly which promoter we were processing at the time, so we can always add print statements to help see what's in Python's head while it runs our code:

In [11]:
def bindingSites( strTFBS, astrPromoters ):
    
    for strPromoter in astrPromoters:
        print( "I'm at location 0, and strPromoter is: " + strPromoter )
        if strPromoter.find( strTFBS ) >= 0:
            afResults.append( True )
        else:
            print( "I'm at location 1, and strPromoter is: " + strPromoter )
            afResults.append( False )
    
    return afResults

bindingSites( "T", ["A", "C", "T", "G"] )
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-723aff200950> in <module>()
     11     return afResults
     12 
---> 13 bindingSites( "T", ["A", "C", "T", "G"] )

<ipython-input-11-723aff200950> in bindingSites(strTFBS, astrPromoters)
      7         else:
      8             print( "I'm at location 1, and strPromoter is: " + strPromoter )
----> 9             afResults.append( False )
     10 
     11     return afResults

NameError: global name 'afResults' is not defined
I'm at location 0, and strPromoter is: A
I'm at location 1, and strPromoter is: A

So it's crashing on the very first promoter sequence, "A". If you look closely, though, the reason why has nothing to do with the promoter sequence itself - we've forgotten to tell Python what the initial value of afResults should be! After all, we know we'd like it to end up as a list containing boolean values, and we've included instructions to append new values into a pre-existing list, but we never gave Python an empty list to fill. Let's fix that and see what happens:

In [12]:
def bindingSites( strTFBS, astrPromoters ):
    
    afResults = []
    for strPromoter in astrPromoters:
        print( "I'm at location 0, and strPromoter is: " + strPromoter )
        if strPromoter.find( strTFBS ) >= 0:
            afResults.append( True )
        else:
            print( "I'm at location 1, and strPromoter is: " + strPromoter )
            afResults.append( False )
    
    return afResults

bindingSites( "T", ["A", "C", "T", "G"] )
I'm at location 0, and strPromoter is: A
I'm at location 1, and strPromoter is: A
I'm at location 0, and strPromoter is: C
I'm at location 1, and strPromoter is: C
I'm at location 0, and strPromoter is: T
I'm at location 0, and strPromoter is: G
I'm at location 1, and strPromoter is: G
Out[12]:
[False, False, True, False]

Not only are our results correct now, the remaining print statements let you see exactly where Python's program counter is during (almost) every step of the process and what data it was modifying at each point. We can take those prints out now that the function appears to be working:

In [13]:
def bindingSites( strTFBS, astrPromoters ):
    
    afResults = []
    for strPromoter in astrPromoters:
        if strPromoter.find( strTFBS ) >= 0:
            afResults.append( True )
        else:
            afResults.append( False )
    
    return afResults

bindingSites( "T", ["A", "C", "T", "G"] )
Out[13]:
[False, False, True, False]

And we can even simplify the function a bit by repacing our if/else combination with a single append of an equivalent value:

In [14]:
def bindingSites( strTFBS, astrPromoters ):
    
    afResults = []
    for strPromoter in astrPromoters:
        fFound = ( strPromoter.find( strTFBS ) >= 0 )
        afResults.append( fFound )
    
    return afResults

bindingSites( "T", ["A", "C", "T", "G"] )
Out[14]:
[False, False, True, False]

Or we can get even shorter by cutting out the middleman:

In [15]:
def bindingSites( strTFBS, astrPromoters ):
    
    afResults = []
    for strPromoter in astrPromoters:
        afResults.append( strPromoter.find( strTFBS ) >= 0 )
    
    return afResults

bindingSites( "T", ["A", "C", "T", "G"] )
Out[15]:
[False, False, True, False]

If you don't want to see Python syntax we haven't covered yet, close your eyes, but if you're curious you can even use a list comprehension to make this even shorter:

In [16]:
def bindingSites( strTFBS, astrPromoters ):
    
    afResults = [( s.find( strTFBS ) >= 0 ) for s in astrPromoters]
    return afResults

bindingSites( "T", ["A", "C", "T", "G"] )
Out[16]:
[False, False, True, False]

And you can again remove the temporary variable to define this function in just two lines:

In [17]:
def bindingSites( strTFBS, astrPromoters ):
    
    return [( s.find( strTFBS ) >= 0 ) for s in astrPromoters]

bindingSites( "T", ["A", "C", "T", "G"] )
Out[17]:
[False, False, True, False]

Now would be a good time to try it out on some more complicated test cases, like our original example:

In [18]:
strTFBS = "TGACTCA"
astrPromoters = ["GGAGTGGTTG", "ATGACTCATA", "CTCCATTCGG", "TTTTTTCGTG", "CATTGACTCA"]

bindingSites( strTFBS, astrPromoters )
Out[18]:
[False, True, False, False, True]

Or some randomly generated sequences, taking advantage of the random module, one of dozens in the Python standard library of built-in modules.

In [19]:
import random

def randomNucleotides( iLength ):
    
    astrNucleotides = []
    for i in xrange( iLength ):
        astrNucleotides.append( "ACGT"[random.randrange( 4 )] )

# The join method targets a string, accepts a list of strings as input,
# and returns a new string created by concatenating all strings in the list
# separated by the target, e.g.
# "-".join( ["a", "b", "c"] )   => "a-b-c"
# ", ".join( ["one", "two"] )   => "one, two"
# "".join( ["sm", "oo", "sh"] ) => "smoosh"
    return "".join( astrNucleotides )

strTFBS = randomNucleotides( 4 )
strTFBS
Out[19]:
'TTCA'
In [20]:
astrPromoters = []
for i in xrange( 100 ):
    astrPromoters.append( randomNucleotides( 50 ) )

# We'll just show the first 10 for brevity's sake
astrPromoters[:10]
Out[20]:
['GAAAATTAAGGAGGCGATTCGTCGCTCATTACTTAAAATTACGAATGGAC',
 'ATTAGTCAGTATCAAAGGTAAGAGCCCGAGTAAACCTTGGCATTTTCCTA',
 'GCTTCCGTCTTTCGACAATCGTAAAGGCCCATCTTAGAGGTTGAGTGAGA',
 'CCTGAATGGCAATTTTTACTGAGCTAGACCCATTCATCAAGCGCTTAATT',
 'CGCCCGACCATCGACGCTTCTGTGGGCCCGTCAACACCGAGTTCAGGGAG',
 'TTGCATGTGTCGTCCTGGAGGCATATAATTTGTGCCGTCATAAGTGAGGG',
 'TCGGGATTACGAGCAACTATCTGACCCCCCTGTGGCGAGATCGGTCCCGG',
 'CGACTGGAGTCTTTGAGCTGCAGCGATCCAATACGAGTTAGAGGTAGGCA',
 'TGCCAGCCCATGTCGACCTCATTAGAGAGTTACGGGTCTATCTGACGTTT',
 'ACGTCTGGGCTATGGACTTGTCGTCTCATCAGGATATTCGACCCGCTCCG']
In [21]:
print( bindingSites( strTFBS, astrPromoters ) )
[False, False, False, True, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, False, True, True, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, True, False, False, True, False, True, True, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, True, False, True, False, False, False, True, False, False]

Depending on exactly how your random data turned out, you should have a nice mix of True and False hits there. Great job - now it would be time to get to work figuring out what those promoters' genes were! Fortunately, we can move on to other bioinformatics problems instead.

Counting codon usage

For our next trick, suppose we're still processing DNA sequences, but we now have an open reading frame (ORF) or coding sequence instead of a noncoding upstream promoter. That is, we have a (spliced) gene sequence that we know encodes a protein, such that every three nucleotides represent one codon. For example, the first several dozen nucleotides of the first isoform of BRCA1 are:

ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAAT GCTATGCAGAAAATCTTAGAGTGTCCCATCTGTCTGGAGTTGATCAAG GAACCTGTCTCCACAAAGTGTGACCACATATTTTGCAAATTTTGCATG CTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGT AAGAATGATATAACCAAAAGGAGCCTACAAGAAAGTACGAGATTTAGT CAACTTGTTGAAGAGCTATTGAAAATCATTTGTGCTTTTCAGCTTGAC ACAGGTTTGGAGTATGCAAACAGCTATAATTTTGCAAAAAAGGAAAAT

High school biochemistry tells us that the first ATG is a methionine, the next GAT is aspartic acid, TTA is leucine, and so forth. And especially for amino acids like leucine that can be coded for by many different possibly codons (TTA, CTT, CTG, etc. etc.), exactly which codons are used in a particular ORF can tell us something about its transcription rate, function, or even charge or hydrophobicity (if there are biases for particular amino acids as well as particular codons). What would it take for us to count how many of each codon occurs in an ORF?

That sounds like a function to me, with some fairly simple parameters:

  • One input string, containing the nucleotide sequence of the ORF of interest.

  • One output dictionary pairing each codon string in the ORF with an integer count of how many times it occurs.

Thus in a very simple example, if our input was:

In [22]:
"ATGGATTTAATG"
Out[22]:
'ATGGATTTAATG'

We'd like our output to be:

In [23]:
{"ATG" : 2, "GAT" : 1, "TTA" : 1}
Out[23]:
{'ATG': 2, 'GAT': 1, 'TTA': 1}

Note that we're making all sorts of assumptions about our input string - it contains only valid nucleotide letters, it's all upper case, its length is a multiple of three - but for the sake of simplicity, that's ok! Just don't forget about when those assumptions might fail in the real world.

That being said, let's try describing this task in Python so the computer can help us carry it out. We know what our function skeleton should look like:

In [24]:
def codonCount( strORF ):
    
    return hashCounts

And we have a pretty good idea of the pseudocode:

def codonCount( strORF ):

    for each codon in the ORF:
        increment the number of times we've seen it

    return the dictionary of all counts

Let's start by figuring out how to find "each codon in the ORF." A codon is a substring of length exactly three, which we can pluck out of a string using the slice operator, which has the syntax string[optional beginning:optional end]:

In [25]:
# Slice the full string from beginning (index 0) to end (index 5)
"abcde"[0:5]
Out[25]:
'abcde'
In [26]:
# Slice off only the first three characters
"abcde"[0:3]
Out[26]:
'abc'
In [27]:
# Slice off the middle two characters
"abcde"[1:3]
Out[27]:
'bc'
In [28]:
# A missing beginning index is assumed to be zero
"abcde"[:3]
Out[28]:
'abc'
In [29]:
# A missing end index is assumed to be the length of the string
"abcde"[1:]
Out[29]:
'bcde'
In [30]:
# And by the way, you can slice lists as well as strings
[1, 2, 3, 4, 5][1:3]
Out[30]:
[2, 3]

Using slice notation, a the first codon in an ORF is just the first three characters:

In [31]:
"ATGGATTTAATG"[0:3]
Out[31]:
'ATG'

The next codon is the following three characters:

In [32]:
"ATGGATTTAATG"[3:6]
Out[32]:
'GAT'

And in general, a codon at any position i is the slice from i to three characters later:

In [33]:
i = 6
"ATGGATTTAATG"[i:( i + 3 )]
Out[33]:
'TTA'

We can combine this with the fact that the full range function takes not just one (its usual) but three integer arguments, a beginning, an end, and a step:

In [34]:
# Default is just an end
range( 5 )
Out[34]:
[0, 1, 2, 3, 4]
In [35]:
# Two arguments mean a beginning and an end
range( 1, 5 )
Out[35]:
[1, 2, 3, 4]
In [36]:
# Three arguments mean a beginning, end, and step, the last of which defaults to one
range( 1, 5, 2 )
Out[36]:
[1, 3]
In [37]:
# 10 to 89 in multiples of 13
range( 10, 90, 13 )
Out[37]:
[10, 23, 36, 49, 62, 75, 88]
In [38]:
# 100 to 1000 in multiples of 111
range( 100, 1001, 111 )
Out[38]:
[100, 211, 322, 433, 544, 655, 766, 877, 988]

Now we have a way to examine every codon in an ORF string:

In [39]:
def codonCount( strORF ):
    
# Start our counts as an empty dictionary to avoid our previous mistake!
    hashCounts = {}
    for i in xrange( 0, len( strORF ), 3 ):
        strCodon = strORF[i:( i + 3 )]
        print( strCodon )
    
    return hashCounts

Although this function isn't doing what we want just yet, we can sneak in a print statement to show that it's finding every codon successfully:

In [40]:
codonCount( "ATGGATTTAATG" )
ATG
GAT
TTA
ATG
Out[40]:
{}

The next part of the pseudocode we need to translate is "increment the number of times we've seen it":

def codonCount( strORF ):

    for each codon in the ORF:
        increment the number of times we've seen it

    return the dictionary of all counts

We know that eventually, hashCounts[strCodon] will be "the number of times we've seen it." So how about:

In [41]:
def codonCount( strORF ):
    
    hashCounts = {}
    for i in xrange( 0, len( strORF ), 3 ):
        strCodon = strORF[i:( i + 3 )]
        hashCounts[strCodon] = hashCounts[strCodon] + 1
    
    return hashCounts

codonCount( "ATGGATTTAATG" )
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-41-d1e08cff6cb6> in <module>()
      8     return hashCounts
      9 
---> 10 codonCount( "ATGGATTTAATG" )

<ipython-input-41-d1e08cff6cb6> in codonCount(strORF)
      4     for i in xrange( 0, len( strORF ), 3 ):
      5         strCodon = strORF[i:( i + 3 )]
----> 6         hashCounts[strCodon] = hashCounts[strCodon] + 1
      7 
      8     return hashCounts

KeyError: 'ATG'

That's pretty inscrutable - almost all Python errors are at first - but the combination of what it said and where is still correct. On the line in which we tried to increment the codon count, the first codon (ATG) was not a valid key to use with the hashCounts dictionary. That's because we can't retrieve the value of a key that's never been stored:

In [42]:
# This is OK since it's in the dictionary
{"example" : 1, "hash" : 2}["example"]
Out[42]:
1
In [43]:
# This is not!
{"example" : 1, "hash" : 2}["barf"]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-43-b4c1fe57ecb3> in <module>()
      1 # This is not!
----> 2 {"example" : 1, "hash" : 2}["barf"]

KeyError: 'barf'

Using square brackets to retrieve a value from a dictionary only works if the requested key is present, although of course we can add a new key at any time:

In [44]:
hashExample = {"example" : 1, "hash" : 2}
hashExample["barf"] = "barfs no more"
hashExample
Out[44]:
{'barf': 'barfs no more', 'example': 1, 'hash': 2}
In [45]:
hashExample["barf"]
Out[45]:
'barfs no more'

Python dictionaries support an additional, less gastrointestinally sensitive value retrieval function as well, however, the get method. get targets a dictionary, accepts one additional input of any type (the key), and returns the key's value if present and None otherwise (without generating an error):

In [46]:
{"example" : 1, "hash" : 2}.get( "example" )
Out[46]:
1
In [47]:
print( {"example" : 1, "hash" : 2}.get( "doesn't barf!" ) )
None

Thus if the result of geting a key is None, it's not in the dictionary; if it's any other value, then it has already been stored. This lets us expand our pseudocode significantly:

def codonCount( strORF ):

    for each codon in the ORF:
        if we haven't yet seen the codon:
            set its count to 1
        else:
            increment the number of times we've seen it

    return the dictionary of all counts

That looks like something we can sink our Pythonic teeth into, since we know how to get the codon's count (or None) to see if we've yet seen it or not:

In [48]:
def codonCount( strORF ):
    
    hashCounts = {}
    for i in xrange( 0, len( strORF ), 3 ):
        strCodon = strORF[i:( i + 3 )]
        iCount = hashCounts.get( strCodon )
        if iCount == None:
            hashCounts[strCodon] = 1
        else:
            hashCounts[strCodon] = iCount + 1
    
    return hashCounts

codonCount( "ATGGATTTAATG" )
Out[48]:
{'ATG': 2, 'GAT': 1, 'TTA': 1}

Bingo! Suddenly it looks like we have a working Python function. But wait, there's more! get takes an optional second argument, the default value to be returned if the requested key is not present. The default value defaults to None, but we can request that it be something different:

In [49]:
{"example" : 1, "hash" : 2}.get( "doesn't barf!", "I'm not here" )
Out[49]:
"I'm not here"

Like, say, the value 0:

In [50]:
def codonCount( strORF ):
    
    hashCounts = {}
    for i in xrange( 0, len( strORF ), 3 ):
        strCodon = strORF[i:( i + 3 )]
        iCount = 1 + hashCounts.get( strCodon, 0 )
        hashCounts[strCodon] = iCount
    
    return hashCounts

codonCount( "ATGGATTTAATG" )
Out[50]:
{'ATG': 2, 'GAT': 1, 'TTA': 1}

And remove the middleman again gets us down to:

In [51]:
def codonCount( strORF ):
    
    hashCounts = {}
    for i in xrange( 0, len( strORF ), 3 ):
        strCodon = strORF[i:( i + 3 )]
        hashCounts[strCodon] = 1 + hashCounts.get( strCodon, 0 )
    
    return hashCounts

codonCount( "ATGGATTTAATG" )
Out[51]:
{'ATG': 2, 'GAT': 1, 'TTA': 1}

It never hurts to do more tests:

In [52]:
codonCount( "ATGGATTTAATGATGGATTTAATGATGGATTTAATGATGGATTTAATGATGGATTTAATG" )
Out[52]:
{'ATG': 10, 'GAT': 5, 'TTA': 5}
In [53]:
strORF = randomNucleotides( 100 )
strORF
Out[53]:
'AAATCGTGGAAGACGAATACGGTGTTGTACGTGTAGCAAATAAGAGTGACATGTACTGGAGTCTAGCTTGATCTACTATATTGCCCGTTGCGATGTGCAT'
In [54]:
print( codonCount( strORF ) )
{'CTT': 1, 'AAG': 1, 'AAA': 1, 'ATA': 1, 'ACA': 1, 'AGA': 1, 'AAT': 1, 'CTA': 2, 'ACT': 1, 'ACG': 2, 'CAA': 1, 'CCG': 1, 'TAT': 1, 'TGT': 2, 'CGA': 1, 'GAT': 1, 'TGC': 1, 'TAG': 2, 'GGA': 1, 'TGG': 1, 'TAC': 1, 'TCG': 1, 'TTG': 2, 'GCA': 1, 'GTC': 1, 'GTG': 3, 'T': 1}
In [55]:
print( codonCount( randomNucleotides( 1000 ) ) )
{'CTT': 8, 'ATG': 6, 'ACA': 8, 'AAA': 5, 'ATC': 3, 'AAC': 10, 'ATA': 6, 'AGG': 6, 'CCT': 8, 'ACT': 8, 'AGC': 7, 'AAG': 7, 'AGA': 7, 'CAT': 6, 'AAT': 1, 'ATT': 4, 'CTG': 6, 'CTA': 8, 'CTC': 10, 'CAC': 3, 'ACG': 4, 'CCG': 4, 'AGT': 3, 'CAG': 8, 'CAA': 3, 'CCC': 4, 'TAT': 4, 'GGT': 5, 'GCG': 3, 'TGT': 6, 'CGA': 2, 'CCA': 3, 'CGC': 3, 'GAT': 4, 'CGG': 3, 'TTT': 3, 'TGC': 4, 'GGG': 5, 'TAG': 7, 'GGA': 2, 'TGG': 4, 'GGC': 7, 'TAC': 4, 'TTC': 5, 'TCG': 6, 'TTA': 6, 'GAC': 5, 'CGT': 7, 'GAA': 7, 'TCA': 7, 'GCA': 4, 'GTA': 8, 'GCC': 3, 'GTC': 6, 'TAA': 3, 'GTG': 8, 'GAG': 2, 'GTT': 7, 'GCT': 5, 'ACC': 3, 'TGA': 6, 'C': 1, 'TTG': 4, 'TCC': 4, 'TCT': 5}

But I think you've got that one licked!

Bioinformatics function exercises

1. We never did get around to defining a factorial function in class. Recall that the factorial operator, written $n!$, operates on an integer, such that:

$$n! = \prod_{i=1}^{n} i$$

Thus $1! = 1$, $2! = 2$, $3! = 6$, $4! = 24$, $5! = 120$, and $0! = 1$ by definition. Complete the following function definition so that the examples below all return True:

In [56]:
def factorial( iN ):
    
    iNFactorial = 1
    return iNFactorial
In [57]:
factorial( 0 ) == 1
Out[57]:
True
In [58]:
factorial( 1 ) == 1
Out[58]:
True
In [59]:
factorial( 6 ) == 720
Out[59]:
False
In [60]:
factorial( 20 ) == 2432902008176640000
Out[60]:
False

2. Lifted from one of my favorite bioinformatics tutorials, Project Rosalind, complete the following function definition that inputs two positive integers and returns the sum of all odd integers between them (inclusive). Note that it's useful to figure out which of the two inputs is larger, and that the modulo operator will tell you if an integer is odd or even:

In [61]:
# Even
14 % 2
Out[61]:
0
In [62]:
# Odd
13 % 2
Out[62]:
1
In [63]:
def anOddSum( iOne, iTwo ):
    
    iSum = 0
    return iSum
In [64]:
def anOddSum( iOne, iTwo ):
    
    if iOne > iTwo:
        iOne, iTwo = iTwo, iOne
    if not ( iOne % 2 ):
        iOne += 1
    iSum = 0
    for i in xrange( iOne, iTwo + 1, 2 ):
        iSum += i
    return iSum
In [65]:
anOddSum( 1, 3 ) == 4
Out[65]:
True
In [66]:
anOddSum( 3, 1 ) == 4
Out[66]:
True
In [67]:
# 3 + 5 + 7
anOddSum( 2, 8 ) == 15
Out[67]:
True
In [68]:
anOddSum( 100, 200 ) == 7500
Out[68]:
True

Data input and output

The ability to read and write information in bulk is one of the most useful tools you'll have for manipulating your own biological data. While standard file formats like FASTAs and BAMs are critical, if you do serious biological research you'll inevitably need to ask a question of your data that nobody's ever asked before. Hence calling it research! Given a set of gene sequences, which ones have one of three degenerate RNA binding protein sites in their first intron or second exon? Given a hundred gene expression conditions, in which are at least one gene from your favorite pathway upregulated and at least one of its genes downregulated? These aren't particularly unusual biological questions, but they'd be very difficult to address with a pre-built system - and very easy to address using a flexible environment like Python. Imagine trying to do science in a foreign language using only a pre-defined phrasebook - it's not impossible, but it's no fun, either. Learning the whole language lets you pursue your own questions on your own terms.

Standard input and output streams

The easiest ways to get data into and out of Python are the standard input (or stdin) and standard output (stdout) streams. Recall that an input stream acts like a special, read-only collection of strings that you can't move backward through. Likewise an output stream is a special target object that knows how to write strings to the screen or a file. Standard input and output are default streams that exist for every program, on every computer, that are by default attached to the command line and the screen, respectively.

Since the IPython Notebook doesn't have a command line, it's rather difficult to demonstrate standard input here. However, you've already seen standard output over and over. Every time you use the print function, it displays its argument to standard output. IPython Notebook captures that and displays it in your web browser, just like it would normally go to the screen. Hence we can write:

In [69]:
print( "This will show up on standard output" )
This will show up on standard output

This is equivalent to writing:

In [70]:
sys.stdout.write( "This will show up on standard output\n" )
This will show up on standard output

Note the one small but important difference: print automatically appends a newline to the end of its argument, writeing to an output stream does not. You can see this in the following examples:

In [71]:
print( "I am the very model" )
print( "of a modern Major General" )
I am the very model
of a modern Major General
In [72]:
sys.stdout.write( "I am the very model" )
sys.stdout.write( "of a modern Major General" )
I am the very modelof a modern Major General

This can be useful if you want to accumulate a variety of output one piece at a time that's shown all on one line, or it can be irritating if you want to write lines to an output stream (standard out or otherwise) and keep forgetting to append newlines. If you're writeing string variables rather than literal strings, you can also do this using concatenation:

In [73]:
def someFunctionThatWritesOutput( ostm, astrOutput ):
    
    for strOutput in astrOutput:
        ostm.write( strOutput + "\n" )

someFunctionThatWritesOutput( sys.stdout, ["animal", "vegetable", "mineral"] )
animal
vegetable
mineral

This is a very typical use of an output stream such as standard output: an output stream is a targetable data type on which the write method is used to produce output. There's only rarely any reason to use sys.stdout rather than print if you just need to show data on the screen (or in a standard output file - more on that later). But sys.stdin can be very useful to read input from the user or a file - I just can't show you here! Typical usage of sys.stdin looks like:

In [74]:
def someFunctionThatReadsInput( istm ):
    
    astrInput = []
    for strLine in istm:
        astrInput.append( strLine )
    
    return astrInput

someFunctionThatReadsInput( sys.stdin )
Out[74]:
[]

But notice that nothing happened - IPython Notebook defines sys.stdin to be empty. That's no problem - I can show you more about input streams using one that's opened on a file rather than standard input - but first we have to discuss creating files and non-standard input/output streams in the first place.

Writing files

You might have noticed above that I created a variable named ostm to "hold" the standard output stream that's normally contained in sys.stdout. This is possible because all output streams are simply special, targetable data values, just like all strings are data values targetable by methods like find and lists are data values targetable by methods like reverse. Standard output happens to be a pre-defined output stream that writes to the screen, but we can create new output streams that write to files instead using the open function.

open takes two arguments, a string file name and a mode requesting to read from that file ("r") for an input stream or to write to that file ("w") for an output stream. It returns either an input or an output stream depending on the mode. Since we're going to create files using output streams first, let's try making one using mode "w" first:

In [75]:
ostmTest = open( "test.txt", "w" )

ostmTest.write( "I am the very model\n" )
ostmTest.write( "of a modern Major General\n" )

If you'd like to see what just happened, use a text editor like jEdit to open up the test.txt file that should have been created in the same directory as your IPython Notebook file. It should contain:

I am the very model of a modern Major General

An opened file stays open either until you stop using the variable in Python or you close it. We can continue writing new data to test.txt:

In [76]:
ostmTest.write( "I've information vegetable, animal, and mineral\n" )
ostmTest.write( "I know the kings of England\n" )

Now if you open up the text file again (or refresh it), it should contain:

I am the very model of a modern Major General I've information vegetable, animal, and mineral I know the kings of England

Since we wrote our example functions above to work with any output stream, not just standard output, we can even reuse them with a newly opened file:

In [77]:
someFunctionThatWritesOutput( open( "test.txt", "w" ), ["animal", "vegetable", "mineral"] )

Now if you look at your file... hey, wait, what happened? It just contains:

animal vegetable mineral

Where did all of your (other) Gilbert and Sullivan lyrics go? Be warned that when you open a file to write to it in mode "w", if it already exists, its contents are completely replaced. Let's say that again for emphasis:

When you open a file, it will be created if it doesn't already exist. But if it does already exist, any previous contents are completely deleted.

This is a great way to lose data! Never, ever open a file in mode "w" that you don't want to lose - for example, a file named python03.py, or my_important_experimental_data.txt. It's just fine to open such files for reading (we'll discuss how to do that next), or you can open then using the append mode "a" to add new data without deleting what's already there:

In [78]:
someFunctionThatWritesOutput( open( "test.txt", "a" ), ["Marathon", "Waterloo"] )

Now your file should contain something more like you expected earlier:

animal vegetable mineral Marathon Waterloo

You can use a variety of structured data formats to create files that store information that's easy for Python to read and write. We'll discuss this a lot more later on, but for now, an easy way to create some example data is using the string split and join methods. split targets a string, accepts one string argument, and returns a list of strings tokenizing the targeted string by separating substrings between each occurrence of the argument:

In [79]:
"this is a test".split( " " )
Out[79]:
['this', 'is', 'a', 'test']
In [80]:
"the, argument, can, be, more, than, one, character, long".split( ", " )
Out[80]:
['the', 'argument', 'can', 'be', 'more', 'than', 'one', 'character', 'long']
In [81]:
"aaabbbcccbbbdddbbbeeebbbfffbbb".split( "bbb" )
Out[81]:
['aaa', 'ccc', 'ddd', 'eee', 'fff', '']
In [82]:
":Note the empty:first and last strings:".split( ":" )
Out[82]:
['', 'Note the empty', 'first and last strings', '']

The join method also targets a string, accepts one argument that's a list of strings, and reverses a split by returning one string combining the given list of tokens separated by the targeted string:

In [83]:
" ".join( ["this", "is", "a", "test"] )
Out[83]:
'this is a test'
In [84]:
", ".join( ["the", "target", "can", "be", "more", "than", "one", "character", "long"] )
Out[84]:
'the, target, can, be, more, than, one, character, long'
In [85]:
"bbb".join( ["aaa", "ccc", "ddd", "eee", "fff", ""] )
Out[85]:
'aaabbbcccbbbdddbbbeeebbbfffbbb'
In [86]:
":".join( ":Note the empty:first and last strings:".split( ":" ) )
Out[86]:
':Note the empty:first and last strings:'

Let's use join in particular to write an example file that we'll use to demonstrate input streams in the next section. It will contain a series of transcript sequences, one per line, with the exons separated by commas:

In [87]:
iNumberOfTranscripts = 1000

ostmExample = open( "transcripts.txt", "w" )
for i in xrange( iNumberOfTranscripts ):
# Each gene has between 1 and 10 exons
    iNumberOfExons = random.randrange( 1, 11 )
    astrExons = []
    for j in xrange( iNumberOfExons ):
# Each exon is between 100 and 1000 nucleotides long
        iNumberOfBases = random.randrange( 100, 1001 )
        astrExons.append( randomNucleotides( iNumberOfBases ) )
    ostmExample.write( ",".join( astrExons ) + "\n" )

Take a look at transcripts.txt, which we just created in the same folder as your IPython Notebook. It should contain a whole bunch of nucleotides, with a few commas thrown in there for good measure. It's no fun to look at by eye - but it's no problem to read back into Python, either!

Reading files

An input stream is conceptually the reverse of an output stream, unsurprisingly. It's a data object that acts like a collection of strings. If you iterate over it using, for example, a for loop, each line of text in the stream is generated one after the other. Unlike a list collection, though, you can't go backwards, only forwards! Again, think of a cursor reading one line at a time through the stream. Although we can't easily see sys.stdin provide this behavior for us in IPython Notebook, we can see it by opening a file with read mode "r", for example the test.txt file created above:

In [88]:
istmTest = open( "test.txt", "r" )

for strLine in istmTest:
    print( strLine )
animal

vegetable

mineral

Marathon

Waterloo

The default mode of the open function is "r", so we can omit it if we want to:

In [89]:
istmTest = open( "test.txt" )

for strLine in istmTest:
    print( strLine )
animal

vegetable

mineral

Marathon

Waterloo

And like we have before, we can cut out the middleman (temporary variable) in many cases to iterate over the input stream directly:

In [90]:
for strLine in open( "test.txt" ):
    print( strLine )
animal

vegetable

mineral

Marathon

Waterloo

Or we can pass it as an input data object to any function that expects an input stream, like the one we wrote above:

In [91]:
someFunctionThatReadsInput( open( "test.txt" ) )
Out[91]:
['animal\n', 'vegetable\n', 'mineral\n', 'Marathon\n', 'Waterloo\n']

Hey, that's funny - why do all of our lines of input text have a newline at the end? We couldn't see it explicitly before when we printed them, but they did look unusually spaced out. Input streams do not typically remove the line delimiter at the end of each line of text they read in, so if we don't like it, we have to remove it manually. Under some circumstances, it might not be a problem, say if we just wanted to print some lines of data back out again:

In [92]:
for strLine in open( "test.txt" ):
    if strLine[1] == "a":
        sys.stdout.write( strLine )
Marathon
Waterloo

But most of the time it's a pain, and we can use a string function like rstrip to remove the trailing newline:

In [93]:
for strLine in open( "test.txt" ):
    print( strLine.rstrip( ) )
animal
vegetable
mineral
Marathon
Waterloo

Note that rstrip removes (strips) all whitespace at the end (right-hand side, hence "r") of a string, not just newlines, so you do have to be a bit careful with it sometimes if you expect your data to contain spaces or tabs:

In [94]:
"no whitespace, nothing happens".rstrip( )
Out[94]:
'no whitespace, nothing happens'
In [95]:
"one newline gets nuked\n".rstrip( )
Out[95]:
'one newline gets nuked'
In [96]:
"many newlines, all gone\n\n\n\n".rstrip( )
Out[96]:
'many newlines, all gone'
In [97]:
"but so is everything else: \t \n  ".rstrip( )
Out[97]:
'but so is everything else:'

Input streams become most useful for biological data handling when they contain interestingly structured data, though, such as the nucleotide sequences we generated above. Let's revisit that list of transcript sequences and print out 1) which genes contained the motif GATACAACG and 2) which exon index they were in:

In [98]:
iGene = 0
for strLine in open( "transcripts.txt" ):
    iGene += 1
    astrExons = strLine.rstrip( ).split( "," )
    for iExon in xrange( len( astrExons ) ):
        if astrExons[iExon].find( "GATACAACG" ) >= 0:
            print( "Found: gene " + str(iGene) + ", exon " + str(iExon + 1) )
Found: gene 26, exon 5
Found: gene 33, exon 4
Found: gene 52, exon 1
Found: gene 150, exon 4
Found: gene 175, exon 1
Found: gene 196, exon 4
Found: gene 259, exon 1
Found: gene 325, exon 6
Found: gene 484, exon 4
Found: gene 663, exon 4
Found: gene 821, exon 2
Found: gene 836, exon 1
Found: gene 959, exon 9

We've made a few design decisions here - gene and exon indices are 1-based rather than 0-based, for example - but otherwise each step in this Python program should make sense:

counting up lines of input (genes) starting before the first...
for each line in "transcripts.txt":
    increment which gene (line of input) we're working on
    remove the trailing newline, and tokenize the transcript into comma-delimited exons
    for each index in the list of exons:
        if the exon string value at that index contains "GATACAACG":
            print which gene and exon index we found it in

If they don't make sense, try changing the program above and seeing what happens, or use the blank cell below to create some new code that reads the file differently (remember, you can always create new cells to experiment using the "Insert" menu above, too). Then go on to try the exercises below - probably more useful than comic opera lyrics!

In [98]:
 

I/O exercises

1. Complete the following function that accepts four arguments:

  1. An output stream.

  2. A column width.

  3. A list of strings, each representing a sequence identifier.

  4. A list of strings, each representing a nucleotide or amino acid sequence.

and writes the sequence identifiers and sequences to the output stream as a correctly formatted multi-FASTA file, with sequences wrapped at the requested width. That is, given the inputs:

In [99]:
iWidth = 4
astrIDs = ["one", "two", "three"]
astrSeqs = ["AACGTAATGCAATAA", "GCTTTGGC", "GCAGAATTAAT"]

the function should produce the output:

>one AACG TAAT GCAA TAA >two GCTT TGGC >three GCAG AATT AAT

Similarly, if we change the column width to 6, it should produce:

>one AACGTA ATGCAA TAA >two GCTTTG GC >three GCAGAA TTAAT
In [100]:
def writeFasta( ostm, iWidth, astrIDs, astrSeqs ):
    
    pass

2. Call the function you just wrote so that it produces a file named sequences.fasta in your current directory, containing 1000 nucleotide sequences named gene1 through gene1000 with lengths randomly chosen between 800 and 1200 nucleotides. Use the randomNucleotides function we wrote above.

In [101]:
writeFasta( sys.stdout, 60, [], [] )

3. Complete the following function that accepts one argument, an input stream, and reads a FASTA file from it to return a list containing two lists of strings, the first sequence IDs and the second sequences. That is, given an input stream containing either of the examples above (regardless of column width), it should return:

In [102]:
[["one", "two", "three"], ["AACGTAATGCAATAA", "GCTTTGGC", "GCAGAATTAAT"]]
Out[102]:
[['one', 'two', 'three'], ['AACGTAATGCAATAA', 'GCTTTGGC', 'GCAGAATTAAT']]
In [103]:
def readFasta( istm ):
    
    return []

4. Complete the following function that accepts three arguments:

  1. A nucleotide motif to find.

  2. A list of gene IDs.

  3. A list of gene sequences.

and returns a list of the gene IDs whose corresponding sequences contain the requested motif. For example, using the example sequences above, the input:

In [104]:
strMotif = "TAA"

should produce the output:

In [105]:
["one", "three"]
Out[105]:
['one', 'three']
In [105]:
 

5. Use readFasta to read in the sequences.fasta file you created above, and pass the resulting data to findMotif to list which of your randomly generated gene sequences contain the motif GATACAAC.

In [105]:
 

6. Why did I choose a motif of that particular length?

In [106]:
strAnswer = ""

7. Why is randomNucleotides a crummy way to generate biologically realistic sequences?

In [107]:
strAnswer = ""
In [107]: