Lecture 2 - 18.3.2015¶

Last update: 17.3.2015¶

Tel-Aviv University / 0411-3122 / Spring 2015¶

This notebook is still a draft.

In [ ]:

from IPython.display import YouTubeVideo, HTML, Image

Previously on Py4Life¶

Python
The IPython notebook
Variables (int, float, bool)
Operators (+, -, *, ..., ==, <, ..., and, or, ...)
Conditional statements (if, elif, else)
While loops

In today's episode¶

Strings
Lists
Loops (for)

Strings¶

Strings are ordered collections of characters.

Ordered¶

Ordered collections means that elements are numbered with indexes: 0, 1, 2, 3, 4...
Note that the first index is 0, not 1!

In [ ]:

YouTubeVideo('kQC82okzTXI')

Characters¶

Characters are textual symbols, like letters (ABCDE...), numerals (12345), punctuation marks (,.?:&), and even things like newline (\n) and whitespace ().

keyboard

Back to strings¶

Most commonly, strings are used to work with text.

We can assign and print strings:

In [ ]:

x = "Py4Life"
y = 'I love python'
print(x)
print(y)

Strings are objects of type str:

In [ ]:

type(x)

We can concat (לשרשר) strings:

In [ ]:

print(x + "2015")

We can convert string to numbers and vice versa (if it is appropriate):

In [ ]:

x = "4"
y = int(x)
print("y+1 =", y + 1)

Otherwise, we get an error message...

In [ ]:

print("x+1 =", x + 1)

In [ ]:

x = str(y)
print("x =", x)

In [ ]:

x = "3.14"
y = float(x)
print("y*2 =", y * 2)

Why do we care about text in programming?¶

Sequences
Data in formated text files (lesson 4)
Free text

Because we are biologists, strings are not just text, they are also sequences!

In [ ]:

dna = "ATGCGTA"
print(dna)

Again we can concat strings:

In [ ]:

upstream = "AAA"
downstream = "GGG"
dna = upstream + "ATG" + downstream
print(dna)

We can find the length of a string using the command len:

In [ ]:

n = len(dna)
print("The length of the DNA variable is", n)

dna = dna + "AGCTGA"
print("Now it is", len(dna))

Just a moment, what was that...?

In [ ]:

print(dna)
dna = dna + "AGCTGA"
print(dna)

also works with numbers:

In [ ]:

x = 10
x = x + 7
print(x)

String slicing¶

We can extract subsets of a string by using slicing, with the corresponding indexes.
Remember: string indexes start from 0!

We can access specific indexes of the list (starting from 0)

In [ ]:

bacteria = 'Escherichia coli'

In [ ]:

# get the 1st and 6th letters
print(bacteria[0])
print(bacteria[5])

Indexes work from the tail as well, using negative numbers:

In [ ]:

# get the last letter
print(bacteria[-1])
# get 5th letter from the end
print(bacteria[-5])

We can get a range of indexes using [start:end]

In [ ]:

# get the 3rd to 8th letters
print(bacteria[2:8])

Notice that the start position is included, but not the end position. We actually take the character with indexes 2,3,4,5,6,7. And what do we get?

In [ ]:

type(bacteria[2:8])

There are shorts for taking the first and last characters:

In [ ]:

# get the first 5 letters
print(bacteria[0:5])
# or simply:
print(bacteria[:5])

# get 3rd to last nucleotides:
print(bacteria[3:])

# last 3 nucleotides
print(bacteria[-3:])

Class exercise 2A¶

The sequence below (named seq) consists of 20 nucleotides.

Print the 2nd and 7th nucleotides.
Print the 2nd nucleotide from the end.
Slice the first half of the sequence.
Slice the second half of the sequence.
Slice the middle 10 nucleotides

In [ ]:

seq = "CAAGTAATGGCAGCCATTAA"
# print 2nd nucleotide
print(seq[1])
# print 7th nucleotide
print(seq[6])
# print 2nd nucleotide from the tail
print(seq[-2])

first_half = seq[:10]
print(first_half)
second_half = seq[10:]
print(second_half)
middle = seq[5:15]
print(middle)

String methods¶

There are some methods (actions, commands) we can operate on strings. These are provoked using the '.' character.

We can change a string to lowercase:

In [ ]:

dna = dna.lower()
print(dna)

And back to uppercase:

In [ ]:

dna = dna.upper()
print(dna)

We can replace characters:

In [ ]:

rna = dna.replace("T", "U")
print(rna)

Count¶

We can count characters.

For example, let's count the number of histidine (H) and proline (P) in the AA (amino-acid) sequence of Human Insulin:

In [ ]:

insulin = 'MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN'
print("# of histidine:", insulin.count('H'))
print("# of proline:", insulin.count('P'))

Find¶

We can find a substring within a string. For example, we can look for the character D in the insulin sequence.

In [ ]:

pos = insulin.index('D')
print(pos)

In [ ]:

type(pos)

In [ ]:

print(insulin[pos])

The result is the index (position) of the first D found in the sequence.

We can also look for longer substrings, representing motiffs. For example, let's find the position of the Insulin B-chain in the entire peptide:

In [ ]:

b_chain = "FVNQHLCGSHLVEALYLVCGERGFFYTPKT"
position = insulin.index(b_chain)
print("Position:", position)

In [ ]:

print(len(b_chain))

In [ ]:

found = insulin[position:position + len(b_chain)] # slicing (notice the ':')
print(b_chain == found)
print("Original:", b_chain)
print("Found:   ", found)

Split¶

We can split a string on every occurence of a separator character:

In [ ]:

names = "melanogaster,simulans,yakuba,ananassae"
species = names.split(",")
print(species)

What do we get?

In [ ]:

type(species)

Lists¶

Lists are similar to strings in being sequential, only they can contain any type of data, not just characters.

This includes int, float, bool, str, and even list.
Lists could even include mixed variable types.

We define a list just like any other variable, but use '[ ]' and ',' to separate elements.

In [ ]:

# a list of strings
apes = ["Homo sapiens", "Pan troglodytes", "Pongo pygmaeus"]
print(apes)

In [ ]:

# a list of numbers
nums = [7,13,2,400]
print(nums)

In [ ]:

# a mixed list
mixed = [12,'Mus musculus',True]
print(mixed)

You can access list elements just like strings, using indexes (starting from 0):

In [ ]:

print("Human:", apes[0])
print("Gorila:", apes[-1])

Lists are dynamic - you can append, remove and insert into them. This is done using list methods, again using the '.':

We can access and change list elements.

In [ ]:

new_apes = apes[:] # make a copy of the apes list
new_apes[2] = 'Hylobates lar'
print(new_apes)

This does NOT work with strings though...

In [ ]:

print(dna)
dna[5] = 'G'

In [ ]:

# add element to the end of the list
apes.append("Gorilla gorilla")
print(apes)

In [ ]:

# insert element at a given index
apes.insert(2, "Pan paniscus")
print(apes)

In [ ]:

# remove element from list
apes.remove("Pongo pygmaeus")
print(apes)

To remove a list item by index:

In [ ]:

# option 1
apes.remove(apes[1])
# option 2
del(apes[1])

We can concat lists, just like strings:

In [ ]:

print(apes + ["Pongo pygmaeus", "Pongo abelii"])

Searching in lists is done using index (not find):

In [ ]:

i = apes.index('Pan troglodytes')
print(i)
print(apes[i])

You can also check if something is in a list (works as well for strings):

In [ ]:

if 'Saguinus nigricollis' in apes:
    print('Saguinus nigricollis is an ape')
else:
    print('Saguinus nigricollis is not an ape')

Lists of numbers¶

Suppose we have a list of experimental measurements and we want to do basic statistics: count the number of results, calculate the average, and find the maximum and minimum.

In [ ]:

measurements = [33, 55,45,87,88,95,34,76,87,56,45,98,87,89,45,67,45,67,76,73,33,87,12,100,77,89,92]

count = len(measurements)
avg = sum(measurements) / len(measurements)
maximum = max(measurements)
minimum = min(measurements)

print(count, "measurements with average", avg, "maximum", maximum, "minimum", minimum)

Sorting lists¶

We can sort lists using the sorted method.
If the list is made entirely of numbers, then sorting is straightforward:

In [ ]:

sorted_measurements = sorted(measurements)
print(sorted_measurements)

A list of strings will be sorted lexicographically (think about the way '<' and '>' work on strings):

In [ ]:

sorted_apes = sorted(apes)
print(sorted_apes)

But beware of mixed lists!

In [ ]:

mixed = apes + measurements
print(mixed)
print(sorted(mixed))

List of lists (nested lists)¶

List elements can be of any type, including lists!
For example:

In [ ]:

birds = ['Gallus gallus', 'Corvus corone', 'Passer domesticus']
snakes = ['Ophiophagus hannah', 'Vipera palaestinae', 'Python bivittatus']
animals = [apes,birds,snakes]
print(animals)

We access lists of lists using double-indexes. For example, to get the 3rd snake:

In [ ]:

print(animals[2][2])

Note that the elements of the outer list are lists themselves, not strings. For example:

In [ ]:

type(animals[1])

List slicing¶

We can slice lists just like we did with strings, to get partial lists.
For example:

In [ ]:

# get the first 10 measurements
print(measurements[:10])
# get the last 3 measurements
print(measurements[-3:])

Class exercise 2B¶

Use the lists birds and snakes defined above to create a single list of strings with the animal names. Then add the string Mus musculus to the list. Finally, remove the Corvus corone from the list. Print the 2nd to 5th elements of the resulting list, sorted alphabetically.

In [ ]:

# create list
animals = birds + snakes
# add Mus musculus
animals.append('Mus musculus')
# remove Corvus corone element
animals.remove('Corvus corone')
# print
print(sorted(animals[1:5]))

Loops¶

Say we want to print each element of our list:

In [ ]:

print(apes[0], "is an ape")
print(apes[1], "is an ape")
print(apes[2], "is an ape")
print(apes[3], "is an ape")

but this is very repetitive and relies on us knowing the number of elements in the list. What we need is a way to say something along the lines of “for each element in the list of apes, print out the element, followed by the words ‘ is an ape’“. Python’s loop syntax allows us to express those instructions like this:

In [ ]:

for ape in apes:
    print(ape, "is an ape")

Python loop

A more complex loop will go over each ape name and print some stats:

In [ ]:

for ape in apes:
    name_length = len(ape)
    first_letter = ape[0]
    print(ape, "is an ape. Its name starts with", first_letter)
    print("Its name has", name_length, "letters")

We can also loop over a string.

Let's go over the Insulin AA sequnce and count the number of prolines manualy:

In [ ]:

count = 0
for aa in insulin:
    if aa == "P":
        count = count + 1
print("# of prolines:", count)

Can you remember another way of doing this?

Let's count how many measurements (see above) are above the average:

In [ ]:

print(measurements)
print(avg)

In [ ]:

over = 0
for x in measurements:
    if x > avg:
        over = over + 1
print(over, "measurements are over the average.")

Class exercise 2C¶

Complete the code below to count the ratio of electrically-charged amino acids in the Insulin sequence.

In [ ]:

charged = ['R','H','K','D','E']

charged_count = 0
for aa in insulin:
    if aa in charged:
        charged_count += 1

insulin_length = len(insulin)
charged_ratio = charged_count/insulin_length
print("Ratio of charged amino acids is:",charged_ratio)

Using `range`¶

Sometimes we want to loop over consecutive numbers.

This is accomplished using the range command.

range accepts one, two, or three arguments: the bottom and upper limits and the step size.
The bottom limit can be omited - default is zero - and the step can be omited - default is 1.
The upper limit is not included.

In [ ]:

for i in range(10): # aka range(0,10,1)
    print(i)

In [ ]:

for i in range(10,20):
    print(i, end=' ') # prints ending with space instead of newline

In [ ]:

for i in range(100,1000,10):
    print(i, end=' ')

Let's check if the number n is a prime number - that is, it can only be divided by 1 and itself:

In [ ]:

n = 97 # try other numbers
divider = 1

for k in range(2,n): # why start at 2? can we choose a different limit to range? a different step perhaps?
    if n % k == 0:
        divider = k
if divider != 1:
    print(n, "is divided by", divider)
else:
    print(n, "is a prime number")

We can also use range() to loop on a list. This is useful in some cases.

In [ ]:

for i in range(len(apes)):
    print(apes[i])

Class exercise 2D¶

1) Restriction fragment lengths¶

Here’s a short DNA sequence:

ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT

The sequence contains a recognition site for the EcoRI restriction enzyme, which cuts at the motif G*AATTC (the position of the cut is indicated by an asterisk). Write a program which will calculate the size of the two fragments that will be produced when the DNA sequence is digested with EcoRI.

(from Python for Biologists)

In [ ]:

fragments = seq.split('GAATTC')
f1_length = len(fragments[0]) + 1 # add 1 for the 'G'
f2_length = len(fragments[1]) + 5 # add 5 for the 'AATTC'
print('Fragment lengths of',f1_length,'and',f2_length,'will be produced.')

2) Complementing DNA¶

Write a program that will print the complement of the sequence above.

(from Python for Biologists)

In [ ]:

complement = ''
for base in seq:
    if base == 'A':
        complement = complement + 'T'
    elif base == 'T':
        complement = complement + 'A' 
    elif base == 'G':
        complement = complement + 'C'
    elif base == 'C':
        complement = complement + 'G'    
    else:
        print("Bad base:", base)
print("Complement:", complement)

3) Loop over ape pictures¶

Go over the ape_pics list and display the pics using the command display(Image(url=<url string>)). Before each pic print the name of that ape from the apes list.

In [ ]:

ape_pics = ['http://upload.wikimedia.org/wikipedia/commons/thumb/6/68/Akha_cropped_hires.JPG/330px-Akha_cropped_hires.JPG', 'http://upload.wikimedia.org/wikipedia/commons/thumb/6/62/Schimpanse_Zoo_Leipzig.jpg/330px-Schimpanse_Zoo_Leipzig.jpg', 'http://upload.wikimedia.org/wikipedia/commons/thumb/6/6e/Bonobo_0155.jpg/330px-Bonobo_0155.jpg', 'http://upload.wikimedia.org/wikipedia/commons/thumb/c/c0/Western_Lowland_Gorilla_at_Bronx_Zoo_2_cropped.jpg/338px-Western_Lowland_Gorilla_at_Bronx_Zoo_2_cropped.jpg']

In [ ]:

from IPython.display import YouTubeVideo, HTML, Image, display
for i in range(len(apes)):
    print(apes[i])
    display(Image(url=ape_pics[i]))

Extra resources¶

Python for Biologists: Strings, Lists and loops
Software carpentry: Strings, Lists

Fin¶

This notebook is part of the Python Programming for Life Sciences Graduate Students course given in Tel-Aviv University, Spring 2015.

Part of this notebook was adapted from the Lists and Loops chapter in Martin Jones's Python for Biologists book.

The notebook was written using Python 3.4.1 and IPython 2.1.0 (download from PyZo).

The code is available at https://github.com//Py4Life/TAU2015/blob/master/lecture2.ipynb.

The notebook can be viewed online at http://nbviewer.ipython.org/github//Py4Life/TAU2015/blob/master/lecture2.ipynb.

The notebook is also available as a PDF at https://github.com/Py4Life/TAU2015/blob/master/lecture2.pdf?raw=true.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Lecture 2 - 18.3.2015¶

Last update: 17.3.2015¶

Tel-Aviv University / 0411-3122 / Spring 2015¶

Previously on Py4Life¶

In today's episode¶

Strings¶

Ordered¶

Characters¶

Back to strings¶

Why do we care about text in programming?¶

String slicing¶

Class exercise 2A¶

String methods¶

Count¶

Find¶

Split¶

Lists¶

Lists of numbers¶

Sorting lists¶

List of lists (nested lists)¶

List slicing¶

Class exercise 2B¶

Loops¶

Class exercise 2C¶

Using range¶

Class exercise 2D¶

1) Restriction fragment lengths¶

2) Complementing DNA¶

3) Loop over ape pictures¶

Extra resources¶

Fin¶

Using `range`¶