In this exercise, we will process and analyze the human mitochondrial genome using Biopython.
All the information that will be used is stored in GenBank record NC_012920. It is recommended that you inspect it before starting to work.
Write a function that receives a GenBank id (such as NC_012920) and returns a Biopython SeqRecord object of the corresponding result. Use it to fetch the human mitochondrial genome record. Assume the default settings, as shown in class. Ignore any warning messages from NCBI that might be displayed.
from Bio import Entrez
from Bio import SeqIO
def fetch_gb_by_id(rec_id):
return gb_record
mito = fetch_gb_by_id('NC_012920')
assert mito.description == 'Homo sapiens mitochondrion, complete genome.'
The record fetched in section 1 is the complete sequence of the mitochondrial genome. We will now extract the genes from this sequence.
A sequence record includes many features, such as CDS, STS, genes and other. To access the features of the mito
SeqRecord, simply use mito.features
.
Let's explore the features field of the record and see some examples by running the following code:
mito_features = mito.features
print(type(mito_features))
print(type(mito_features[0]))
print('************')
print(mito_features[0])
print('************')
print(mito_features[1])
print('************')
print(mito_features[2])
You can learn more about SeqRecord features by visiting the dedicated documentation page.
We can see that there are different types of features, and that each one contains several fields of information. We will only need the type
, the location
and the qualifiers
fields, which We can access like this:
gene = mito.features[2]
print(gene.type)
print(gene.location)
print(gene.qualifiers)
print(type(gene.qualifiers))
a) Write a function that receives a SeqRecord and extract its gene features. The function should return a list of dictionaries, where each dictionary represents one gene. Gene dictionaries will have three keys: start
(start position), end
(end position) and name
(gene name). Complete the code below and use it on the mito
SeqRecord.
def extract_genes(gb_record):
genes = []
for feat in gb_record.features:
# if feature is gene
if ______________________:
gene_dict = {}
try:
# get start and end positions
location = ________________
start = __________________
end = ___________________
gene_name = _________________
# insert into dictionary
except:
print('Failed to obtain stats for feature')
if gene_dict != {}:
genes.append(gene_dict)
return genes
mito_genes = extract_genes(mito)
assert len(mito_genes) == 37
assert type(mito_genes[0]) == dict
b) We will now use the result of section 2a to extract the genes from the complete sequence.
Write a function that receives a SeqRecord and a list of genes (as created in section 2a) and returns a list of Seqrecords, each corresponding to a single gene. The description of the output SeqRecords should be the gene name, and the id should be an index running from 1 to the number of genes in the list. Run the function on the mitochondrial data. Complete the code.
def genes_records(gb_record,gene_list):
gene_records = []
for gene in gene_list:
start = ______________
end = ______________
name = ______________
rec = gb_record[start:end]
rec.description = ______________
rec.id = ______________
gene_records.append(rec)
return ______________
mito_gene_records = genes_records(mito,mito_genes)
assert len(mito_gene_records) == 37
Print the gene records to an output file of your choice, in fasta format.
out_file = ""
# print to output file