Notebook

1) Distance between sequences¶

A common task in sequence analysis is to calculate the distance between two DNA sequences.
There are many ways to define this distance, but the most simple one is called the Hamming distance. This distance is defined as the number of differences between two sequences of the same length. For example, the Hamming distance between AGGTCT and AGCTAT is 2. The distance between two identical sequences is 0.

a) Write a function that receives two strings, representing DNA sequences, and returns the Hamming distance between them as an integer. Use an assertion to verify that the sequences are of the same length.

In [3]:

def hamming_distance(seq1, seq2):
    """
    Calculates the Hamming distance of two DNA sequences, given as strings.
    Returns score as an integer.
    Input sequences must have the same length!
    """
    assert len(seq1) == len(seq2), "Input sequences must have the same length!"
    hamming_dis = 0
    for i in range(len(seq1)):
        if seq1[i] != seq2[i]:
            hamming_dis += 1
    return hamming_dis   
    
    
assert hamming_distance('AGGTCT', 'AGGTCT') == 0
assert hamming_distance('AGGTCT', 'AGCTAT') == 2

A more complex way of evaluating distance between sequences is using a Cost-matrix. Such a matrix describes the cost of each difference between the sequences. For example, a change from A to G may have a cost of 1, while a change from A to T may have a cost of 3. This method is sometimes called the Sankoff distance. The file cost_matrix.txt contains such a matrix, in a tab-delimited format. Using this matrix, the Sankoff distance between AGGTCT and AGCTAT is 6.

b) Write a function that receives a path to a cost matrix file (using the format described above), parses it and stores the information in a data structure of your choice. The function will return this data structure. In this section, you are not allowed to import any modules!.

In [11]:

def read_cost_matrix(filename):
    """
    Reads a cost matrix file. Returns a dictionary, where the keys are nucleotides (A,G,C,T)
    and the valuess are dictionaries. Each inner dictionary's keys are the nucleotides, and the
    valuess are the costs. To query this data structure, use: cost_matrix['N1']['N2'] 
    """
    cost_matrix = {}
    with open(filename,'r') as f:
        heads = f.readline().strip()   # get the first line
        nucleotides = heads.split('\t')   # get nucleotides in first line as list
        for line in f:
            line = line.strip()
            split_line = line.split('\t')
            from_nuc = split_line[0]   # nucleotide in left column  
            costs = split_line[1:]   # costs of change, as list
            costs_dict = {}   # inner dictionary
            for i in range(4):
                to_nuc = nucleotides[i]   # nucleotide from heads
                costs_dict[to_nuc] = int(costs[i])   # insert cost to inner dict
            cost_matrix[from_nuc] = costs_dict   # insert inner dict as value
    return cost_matrix   
    
    
mat = read_cost_matrix('cost_matrix.txt')
assert mat != None

c) Write a function that receives two strings, representing DNA sequences, and a cost matrix (formated as defined by the function in section b), and returns the Sankoff distance between the sequences based on the given matrix as an integer. Use an assertion to verify that the sequences are of the same length.

In [12]:

def sankoff_distance(seq1, seq2, cost_mat):
    """
    Calculates the Sankoff distance of two DNA sequences, given as strings,
    based on a given cost matrix.
    Returns score as an integer.
    Input sequences must have the same length!
    """
    assert len(seq1) == len(seq2), "Input sequences must have the same length!"
    sank_dis = 0
    for i in range(len(seq1)):
        sank_dis += cost_mat[seq1[i]][seq2[i]]
    return sank_dis  
    
    
    
assert sankoff_distance('AGGTCT', 'AGGTCT',mat) == 0
assert sankoff_distance('AGGTCT', 'AGCTAT',mat) == 6

2) 16S Kink-turn¶

The Kink-turn (usually called K-turn) is a common structural motiff found in many bacterial 16S rRNA sequences. It introduces a very tight kink into the axis of helical RNA. The motiff occurs in sequences of the form CGRNNGANC (where R is A or G and N is any nucleotide).

a) Write a function that receives a list of GenBank accession IDs (as strings) and returns a list of SeqRecord objects, fetched from GenBank according to these accessions, like we did in lecture 6.

In [15]:

from Bio import Entrez, SeqIO
Entrez.email = 'A.N.Other@example.com'

def fetch_gb_records(gb_acc_list):
    """
    Receives a list of GB accessions as strings.
    Returns a list of the corresponding SeqRecords.
    """
    gb_records = []
    for acc in gb_acc_list:
        try:
            entrez_obj = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id=acc)   # fetch record from GB
            record = SeqIO.read(entrez_obj, "gb")   # read record as SeqRecord
            gb_records.append(record)   # insert to records list
        except:
            print('Failed to fetch record for id',acc)
    return(gb_records)
    

bacteria_16s_accessions = ['EU014689','AJ578036','AF201899','NR_028978','EU118114','AM158979','AY773947','AJ697941','X81660','X83947']
bacteria_16s_records = fetch_gb_records(bacteria_16s_accessions)
assert len(bacteria_16s_records) == len(bacteria_16s_accessions)

b) Write a function that receives a list of SeqRecords and checks for each sequence if it contains a certain motiff, given as a pre-compiled regular expression. If it does, store the exact sequence of the motiff (only the motiff, not the whole sequence!) in a dict. The function returns a dict where the keys are the organism names and the values are the motiff sequences, as strings. If a sequence does not include the motiff, do not add it to the dict.
Complete the regex to scan the 16S sequences for K-turn motiffs.

In [23]:

import re

def find_motiff_in_records(rec_list,motiff_regex):
    """
    Receives a list of SeqRecords and searches them for a motiff, given as a regex.
    Returns a dictionary where the keys are the organism names and the values are the matched motiffs.
    """
    motiffs_dict = {}
    for rec in rec_list:
        search_result = re.search(motiff_regex,str(rec.seq))
        if search_result != None:
            organism = rec.annotations['organism']
            exact_motiff = search_result.group()
            motiffs_dict[organism] = exact_motiff  

    return motiffs_dict

kink_turn_regex = re.compile(r'CG[AG][AGCT]{2}GA[AGCT]C')
kink_turn_motiffs_dict = find_motiff_in_records(bacteria_16s_records,kink_turn_regex)

3) Relation between litter size and birth weight¶

In this question we will look for a relation between litter or clutch size (number of offspring per reproductive cycle) and the birth weight (the weight of the offpring) in the animal kingdom.

For this analysis we will load the AnAge dataset that we used in lecture 7.

First, import the neccesary libraries:

In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy
import pandas as pd
import seaborn as sns
import urllib
import zipfile

a) Get the zip file containing the data, extract it and read the data to a DataFrame. We are interested in the Litter/Clutch size and Birth weight (g) columns.

In [11]:

urllib.request.urlretrieve('http://genomics.senescence.info/species/dataset.zip', 'anage_dataset.zip')

Out[11]:

('anage_dataset.zip', <http.client.HTTPMessage at 0xb189d50>)

In [12]:

with zipfile.ZipFile('anage_dataset.zip') as z:
    f = z.open('anage_data.txt')
    data = pd.read_table(f)

b) If you examined the data you might have noticed that some rows have a NaN value in our columns of interest. We need to remove these rows from the data. You can use np.isnan, np.isfinite or any other method you'd like.

In [13]:

data = data[np.isfinite(data['Litter/Clutch size'])]
data = data[np.isfinite(data['Birth weight (g)'])]

c) Plot a scatter plot of the data to exmaine it. Use the litter size on the x-axis. Don't forget the axis labels.

In [8]:

data.plot(x='Litter/Clutch size', y='Birth weight (g)', kind="scatter");

d) We are looking for a possible linear relationship between the variables. Apply a log transformation on the data (both columns) and plot a new scatter plot of the transformed data (don't forget the axis labels should change to reflect the transformation!).

In [10]:

data['Log Litter/Clutch size'] = np.log(data['Litter/Clutch size'])
data['Log Birth weight (g)'] = np.log(data['Birth weight (g)'])

data.plot(x='Log Litter/Clutch size', y='Log Birth weight (g)', kind="scatter");

e) Perform linear regression on the transformed data and print the intercept and slope of the regression.

In [30]:

slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x=data['Log Litter/Clutch size'], y=data['Log Birth weight (g)'])
print("intercept: %.3f, slope: %.3f" % (intercept, slope))

intercept: 5.718, slope: -2.434

f) Plot a scatterplot of data together with a line for the linear regression.

In [23]:

data.plot(x='Log Litter/Clutch size', y='Log Birth weight (g)', kind="scatter")
x = np.linspace(0,5)
plt.plot(x, [intercept + slope*_x_ for _x_ in x], 'k-')

Out[23]:

[<matplotlib.lines.Line2D at 0x7c0a8d0>]

In [24]:

sns.lmplot(x='Log Litter/Clutch size', y='Log Birth weight (g)', data=data)
plt.xlim(-1,6)
plt.ylim(-5,15)

Out[24]:

(-5, 15)

g) predict the birth weight of offspring in a litter with 10 offspring (don't forget the transformation!):

In [35]:

print("In a litter with 10 offspring, the birth weight will be", np.exp(intercept + slope * np.log(10.)), "grams")

In a litter with 10 offspring, the birth weight will be 1.11964209943 grams