In this question we will calculate the hydrphobicity of a protein based on its amino-acid (aa) sequence.
In chemistry, hydrophobicity is the physical property of a molecule (known as a hydrophobe) that is seemingly repelled from a mass of water.
%matplotlib inline
import matplotlib.pyplot as plt
ges_scale
is a dict
that contains the hydrphobicity score of every aa. The keys are the letters that represent the aa, the values are the scores. For example, the letter for Leucine is L
and its hydrphobicity score is -2.8.
ges_scale = {'F':-3.7,'M':-3.4,'I':-3.1,'L':-2.8,'V':-2.6,
'C':-2.0,'W':-1.9,'A':-1.6,'T':-1.2,'G':-1.0,
'S':-0.6,'P': 0.2,'Y': 0.7,'H': 3.0,'Q': 4.1,
'N': 4.8,'E': 8.2,'K': 8.8,'D': 9.2,'R':12.3}
a) Write a function called hydrphobicity
that calculates the hydrphobicity of a sequence. The function calculates the average hydrphobicity around every position in the sequence. The average is calculated over a window - a set number of positions. This method is knows as a sliding window as after each calculation the window slides to the next position.
Input: seq
is a string, each character is an aa letter. win_size
is the windows size to work with (number of positions on which to average).
Output: list
of float
s of the hydrphobicity scores, calculated for each position of the window.
def hydrophobicity(seq, win_size=15):
"""Scan a protein sequence for hydrophobic regions using the GES
hydrophobicity scale.
"""
score = None
score_list = []
for i in range(len(seq)- win_size+1):
j = i + win_size
if score is None:
score = 0
for k in range(i,j):
score += ges_scale[seq[k]]
else:
score += ges_scale[seq[j - 1]]
score -= ges_scale[seq[i - 1]]
score_list.append(score/win_size)
return score_list
protein_seq = 'IRTNGTHMQPLLKLMKFQKFLLELFTLQKRKPEKGYNLPIISLNQ'
scores = hydrophobicity(protein_seq)
print(scores)
[0.7666666666666665, 1.5599999999999998, 0.49333333333333323, 0.8466666666666665, 1.1133333333333333, 0.9333333333333333, 0.8266666666666665, 0.43999999999999984, 1.2133333333333332, 0.7533333333333331, 0.493333333333333, 0.5999999999999996, 0.5999999999999996, 0.28666666666666624, 1.0599999999999996, 2.1066666666666665, 2.106666666666666, 2.3666666666666663, 2.6399999999999992, 2.6399999999999997, 2.82, 3.0533333333333332, 3.5599999999999996, 2.826666666666666, 3.026666666666666, 3.066666666666666, 2.9399999999999995, 3.086666666666666, 2.626666666666666, 2.3599999999999994, 1.8133333333333328]
b) Next, plot the hydrophobicity scores. Don't forget the axis labels.
plt.plot(scores)
plt.xlabel('Protein position')
plt.ylabel('Hydrophobicity');
In this question we will translate RNA sequences to amino-acid (aa) sequences.
genetic_code
is a dict
in which keys are RNA codons - triplets of base names (A, G, C, T) and the values are short names of aa (Leu
for Leucine, etc.).
genetic_code = {
'UUU':'Phe', 'UUC':'Phe', 'UCU':'Ser', 'UCC':'Ser',
'UAU':'Tyr', 'UAC':'Tyr', 'UGU':'Cys', 'UGC':'Cys',
'UUA':'Leu', 'UCA':'Ser', 'UAA':None, 'UGA':None,
'UUG':'Leu', 'UCG':'Ser', 'UAG':None, 'UGG':'Trp',
'CUU':'Leu', 'CUC':'Leu', 'CCU':'Pro', 'CCC':'Pro',
'CAU':'His', 'CAC':'His', 'CGU':'Arg', 'CGC':'Arg',
'CUA':'Leu', 'CUG':'Leu', 'CCA':'Pro', 'CCG':'Pro',
'CAA':'Gln', 'CAG':'Gln', 'CGA':'Arg', 'CGG':'Arg',
'AUU':'Ile', 'AUC':'Ile', 'ACU':'Thr', 'ACC':'Thr',
'AAU':'Asn', 'AAC':'Asn', 'AGU':'Ser', 'AGC':'Ser',
'AUA':'Ile', 'ACA':'Thr', 'AAA':'Lys', 'AGA':'Arg',
'AUG':'Met', 'ACG':'Thr', 'AAG':'Lys', 'AGG':'Arg',
'GUU':'Val', 'GUC':'Val', 'GCU':'Ala', 'GCC':'Ala',
'GAU':'Asp', 'GAC':'Asp', 'GGU':'Gly', 'GGC':'Gly',
'GUA':'Val', 'GUG':'Val', 'GCA':'Ala', 'GCG':'Ala',
'GAA':'Glu', 'GAG':'Glu', 'GGA':'Gly', 'GGG':'Gly'}
a) Write a function called protein_translation
that translates an RNA sequence to an aa sequence.
Input: seq
is a string of A, G, C and U (all uppercase).
Output: list
of strings of short names of aa.
def protein_translation(seq):
""" This function translates a nucleic acid sequence into a
protein sequence, until the end or until it comes across
a stop codon
"""
seq = seq.replace('T','U') # Make sure we have RNA sequence
proteinSeq = []
i = 0
while i+2 < len(seq):
codon = seq[i:i+3]
aminoAcid = genetic_code[codon]
if aminoAcid is None: # Found stop codon
break
proteinSeq.append(aminoAcid)
i += 3
return proteinSeq
dna_seq = 'ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTG'
aa_seq = protein_translation(dna_seq)
print(aa_seq)
['Met', 'Val', 'His', 'Leu', 'Thr', 'Pro', 'Glu', 'Glu', 'Lys', 'Ser', 'Ala', 'Val', 'Thr', 'Ala', 'Leu', 'Trp', 'Gly', 'Lys', 'Val']
a) Write a function called fetch_gb_by_id
that receives a GenBank ID (such as KF241981
) and returns a Biopython SeqRecord
object of the corresponding result. Use it to fetch the maize chloroplast genome record.
Assume the default settings, as shown in lecture 6. Ignore any warning messages from NCBI that might be displayed.
from Bio import Entrez
from Bio import SeqIO
def fetch_gb_by_id(rec_id):
handle = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id=rec_id)
gb_record = SeqIO.read(handle, "gb")
handle.close()
return gb_record
maize_chl = fetch_gb_by_id('KF241981')
print(maize_chl.description)
assert len(maize_chl.features) == 263
Zea mays subsp. mays cultivar B73 chloroplast, complete genome.
b) Write a function called extract_rRNA
that receives a SeqRecord
object and extract its gene features.
The function should return a list
of SeqRecord
objects of the corresponding sequences.
Change the description
field to have the corresponding gene name, followed by |
and the sequence length. For example: rrn22 | 1231
.
def extract_rRNA(gb_record):
rRNAs = []
for feat in gb_record.features:
if feat.type == 'rRNA':
name = feat.qualifiers['gene'][0]
location = feat.location
start = location.start
end = location.end
rRNA = gb_record[start:end]
rRNA.description = name + ' | ' + str(len(rRNA))
rRNAs.append(rRNA)
return rRNAs
maize_chl_rRNAs = extract_rRNA(maize_chl)
print([x.description for x in maize_chl_rRNAs])
['rrn16 | 1492', 'rrn23 | 2888', 'rrn4.5 | 95', 'rrn5 | 121', 'rrn5 | 121', 'rrn4.5 | 95', 'rrn23 | 2888', 'rrn16 | 1492']
c) Print the rRNA sequences to the output file maize_chloroplast_rRNAs.fasta
in fasta format.
out_file = "maize_chloroplast_rRNAs.fasta"
SeqIO.write(maize_chl_rRNAs,out_file,'fasta')
8