Notebook

Python Programming for Biologists, Tel-Aviv University / 0411-3122 / Spring 2015 ¶

Homework 5¶

Analyzing the human mitochondrial genome¶

In this exercise, we will process and analyze the human mitochondrial genome using Biopython.
All the information that will be used is stored in GenBank record NC_012920. It is recommended that you inspect it before starting to work.

1) Fetching the GenBank record¶

Write a function that receives a GenBank id (such as NC_012920) and returns a Biopython SeqRecord object of the corresponding result. Use it to fetch the human mitochondrial genome record. Assume the default settings, as shown in class. Ignore any warning messages from NCBI that might be displayed.

In [ ]:

from Bio import Entrez
from Bio import SeqIO

def fetch_gb_by_id(rec_id):

    return gb_record

mito = fetch_gb_by_id('NC_012920')
assert mito.description == 'Homo sapiens mitochondrion, complete genome.'

2) Extracting the genes¶

The record fetched in section 1 is the complete sequence of the mitochondrial genome. We will now extract the genes from this sequence.
A sequence record includes many features, such as CDS, STS, genes and other. To access the features of the mito SeqRecord, simply use mito.features.
Let's explore the features field of the record and see some examples by running the following code:

In [ ]:

mito_features = mito.features
print(type(mito_features))
print(type(mito_features[0]))
print('************')
print(mito_features[0])
print('************')
print(mito_features[1])
print('************')
print(mito_features[2])

You can learn more about SeqRecord features by visiting the dedicated documentation page.
We can see that there are different types of features, and that each one contains several fields of information. We will only need the type, the location and the qualifiers fields, which We can access like this:

In [ ]:

gene = mito.features[2]
print(gene.type)
print(gene.location)
print(gene.qualifiers)
print(type(gene.qualifiers))

a) Write a function that receives a SeqRecord and extract its gene features. The function should return a list of dictionaries, where each dictionary represents one gene. Gene dictionaries will have three keys: start (start position), end (end position) and name (gene name). Complete the code below and use it on the mito SeqRecord.

In [ ]:

def extract_genes(gb_record):
    genes = []
    for feat in gb_record.features:
        # if feature is gene
        if ______________________:
            gene_dict = {}
            try:
                # get start and end positions
                location = ________________
                start = __________________
                end = ___________________
                gene_name = _________________
                
                # insert into dictionary
                
                
                

            except:
                print('Failed to obtain stats for feature')
            if gene_dict != {}:
                genes.append(gene_dict)
    return genes

mito_genes = extract_genes(mito)
assert len(mito_genes) == 37
assert type(mito_genes[0]) == dict

b) We will now use the result of section 2a to extract the genes from the complete sequence.
Write a function that receives a SeqRecord and a list of genes (as created in section 2a) and returns a list of Seqrecords, each corresponding to a single gene. The description of the output SeqRecords should be the gene name, and the id should be an index running from 1 to the number of genes in the list. Run the function on the mitochondrial data. Complete the code.

In [ ]:

def genes_records(gb_record,gene_list):
    gene_records = []
    for gene in gene_list:
        start = ______________
        end = ______________
        name = ______________
        rec = gb_record[start:end]
        rec.description = ______________
        rec.id = ______________
        gene_records.append(rec)
    return ______________

mito_gene_records = genes_records(mito,mito_genes)
assert len(mito_gene_records) == 37

3) Printing the result¶

Print the gene records to an output file of your choice, in fasta format.

In [ ]:

out_file = ""
# print to output file

Python Programming for Biologists, Tel-Aviv University / 0411-3122 / Spring 2015¶

Homework 5¶

Analyzing the human mitochondrial genome¶

1) Fetching the GenBank record¶

2) Extracting the genes¶

3) Printing the result¶

Python Programming for Biologists, Tel-Aviv University / 0411-3122 / Spring 2015 ¶