Notebook

Exam - Example - Solution¶

Tel-Aviv University / 0411-3122 / Spring 2015¶

1) Protein hydrophobicity¶

In this question we will calculate the hydrphobicity of a protein based on its amino-acid (aa) sequence.

In chemistry, hydrophobicity is the physical property of a molecule (known as a hydrophobe) that is seemingly repelled from a mass of water.

In [2]:

%matplotlib inline
import matplotlib.pyplot as plt

ges_scale is a dict that contains the hydrphobicity score of every aa. The keys are the letters that represent the aa, the values are the scores. For example, the letter for Leucine is L and its hydrphobicity score is -2.8.

In [3]:

ges_scale = {'F':-3.7,'M':-3.4,'I':-3.1,'L':-2.8,'V':-2.6,
            'C':-2.0,'W':-1.9,'A':-1.6,'T':-1.2,'G':-1.0,
            'S':-0.6,'P': 0.2,'Y': 0.7,'H': 3.0,'Q': 4.1,
            'N': 4.8,'E': 8.2,'K': 8.8,'D': 9.2,'R':12.3}

a) Write a function called hydrphobicity that calculates the hydrphobicity of a sequence. The function calculates the average hydrphobicity around every position in the sequence. The average is calculated over a window - a set number of positions. This method is knows as a sliding window as after each calculation the window slides to the next position.

Input: seq is a string, each character is an aa letter. win_size is the windows size to work with (number of positions on which to average).

Output: list of floats of the hydrphobicity scores, calculated for each position of the window.

In [4]:

def hydrophobicity(seq, win_size=15):
  """Scan a protein sequence for hydrophobic regions using the GES
     hydrophobicity scale.
  """

  score = None
  score_list  = []

  for i in range(len(seq)- win_size+1):
    j = i + win_size
    
    if score is None:
      score = 0
      for k in range(i,j):
        score += ges_scale[seq[k]]

    else:
      score += ges_scale[seq[j - 1]]
      score -= ges_scale[seq[i - 1]]

    score_list.append(score/win_size)

  return score_list

protein_seq = 'IRTNGTHMQPLLKLMKFQKFLLELFTLQKRKPEKGYNLPIISLNQ'
scores = hydrophobicity(protein_seq)
print(scores)

[0.7666666666666665, 1.5599999999999998, 0.49333333333333323, 0.8466666666666665, 1.1133333333333333, 0.9333333333333333, 0.8266666666666665, 0.43999999999999984, 1.2133333333333332, 0.7533333333333331, 0.493333333333333, 0.5999999999999996, 0.5999999999999996, 0.28666666666666624, 1.0599999999999996, 2.1066666666666665, 2.106666666666666, 2.3666666666666663, 2.6399999999999992, 2.6399999999999997, 2.82, 3.0533333333333332, 3.5599999999999996, 2.826666666666666, 3.026666666666666, 3.066666666666666, 2.9399999999999995, 3.086666666666666, 2.626666666666666, 2.3599999999999994, 1.8133333333333328]

b) Next, plot the hydrophobicity scores. Don't forget the axis labels.

In [5]:

plt.plot(scores)
plt.xlabel('Protein position')
plt.ylabel('Hydrophobicity');

2) Translate DNA to protein¶

In this question we will translate RNA sequences to amino-acid (aa) sequences.

genetic_code is a dict in which keys are RNA codons - triplets of base names (A, G, C, T) and the values are short names of aa (Leu for Leucine, etc.).

In [6]:

genetic_code = { 
          'UUU':'Phe', 'UUC':'Phe', 'UCU':'Ser', 'UCC':'Ser',
          'UAU':'Tyr', 'UAC':'Tyr', 'UGU':'Cys', 'UGC':'Cys',
          'UUA':'Leu', 'UCA':'Ser', 'UAA':None,  'UGA':None,
          'UUG':'Leu', 'UCG':'Ser', 'UAG':None,  'UGG':'Trp',
          'CUU':'Leu', 'CUC':'Leu', 'CCU':'Pro', 'CCC':'Pro',
          'CAU':'His', 'CAC':'His', 'CGU':'Arg', 'CGC':'Arg',
          'CUA':'Leu', 'CUG':'Leu', 'CCA':'Pro', 'CCG':'Pro',
          'CAA':'Gln', 'CAG':'Gln', 'CGA':'Arg', 'CGG':'Arg',
          'AUU':'Ile', 'AUC':'Ile', 'ACU':'Thr', 'ACC':'Thr',
          'AAU':'Asn', 'AAC':'Asn', 'AGU':'Ser', 'AGC':'Ser',
          'AUA':'Ile', 'ACA':'Thr', 'AAA':'Lys', 'AGA':'Arg',
          'AUG':'Met', 'ACG':'Thr', 'AAG':'Lys', 'AGG':'Arg',
          'GUU':'Val', 'GUC':'Val', 'GCU':'Ala', 'GCC':'Ala',
          'GAU':'Asp', 'GAC':'Asp', 'GGU':'Gly', 'GGC':'Gly',
          'GUA':'Val', 'GUG':'Val', 'GCA':'Ala', 'GCG':'Ala', 
          'GAA':'Glu', 'GAG':'Glu', 'GGA':'Gly', 'GGG':'Gly'}

a) Write a function called protein_translation that translates an RNA sequence to an aa sequence.

Input: seq is a string of A, G, C and U (all uppercase).

Output: list of strings of short names of aa.

In [7]:

def protein_translation(seq):
    """ This function translates a nucleic acid sequence into a
    protein sequence, until the end or until it comes across
    a stop codon
    """
   
    seq = seq.replace('T','U') # Make sure we have RNA sequence
    proteinSeq = []

    i = 0
    while i+2 < len(seq):
        codon = seq[i:i+3]
        aminoAcid = genetic_code[codon]

        if aminoAcid is None: # Found stop codon
            break

        proteinSeq.append(aminoAcid)
        i += 3

    return proteinSeq

dna_seq = 'ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTG'
aa_seq = protein_translation(dna_seq)
print(aa_seq)

['Met', 'Val', 'His', 'Leu', 'Thr', 'Pro', 'Glu', 'Glu', 'Lys', 'Ser', 'Ala', 'Val', 'Thr', 'Ala', 'Leu', 'Trp', 'Gly', 'Lys', 'Val']

3) Maize chloroplast¶

a) Write a function called fetch_gb_by_id that receives a GenBank ID (such as KF241981) and returns a Biopython SeqRecord object of the corresponding result. Use it to fetch the maize chloroplast genome record.

Assume the default settings, as shown in lecture 6. Ignore any warning messages from NCBI that might be displayed.

In [8]:

from Bio import Entrez
from Bio import SeqIO

In [10]:

def fetch_gb_by_id(rec_id):
    handle = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id=rec_id)
    gb_record = SeqIO.read(handle, "gb")
    handle.close()
    return gb_record

maize_chl = fetch_gb_by_id('KF241981')
print(maize_chl.description)
assert len(maize_chl.features) == 263

Zea mays subsp. mays cultivar B73 chloroplast, complete genome.

b) Write a function called extract_rRNA that receives a SeqRecord object and extract its gene features. The function should return a list of SeqRecord objects of the corresponding sequences.

Change the description field to have the corresponding gene name, followed by | and the sequence length. For example: rrn22 | 1231.

In [11]:

def extract_rRNA(gb_record):
    rRNAs = []
    for feat in gb_record.features:
        if feat.type == 'rRNA':
            name = feat.qualifiers['gene'][0]
            location = feat.location
            start = location.start
            end = location.end
            rRNA = gb_record[start:end]
            rRNA.description = name + ' | ' + str(len(rRNA))
            rRNAs.append(rRNA)
    return rRNAs
         
maize_chl_rRNAs = extract_rRNA(maize_chl)
print([x.description for x in maize_chl_rRNAs])

['rrn16 | 1492', 'rrn23 | 2888', 'rrn4.5 | 95', 'rrn5 | 121', 'rrn5 | 121', 'rrn4.5 | 95', 'rrn23 | 2888', 'rrn16 | 1492']

c) Print the rRNA sequences to the output file maize_chloroplast_rRNAs.fasta in fasta format.

In [12]:

out_file = "maize_chloroplast_rRNAs.fasta"
SeqIO.write(maize_chl_rRNAs,out_file,'fasta')

Out[12]: