In this exercise, we will process and analyze the human mitochondrial genome using Biopython.
All the information that will be used is stored in GenBank record NC_012920. It is recommended that you inspect it before starting to work.
Write a function that receives a GenBank id (such as NC_012920) and returns a Biopython SeqRecord object of the corresponding result. Use it to fetch the human mitochondrial genome record. Assume the default settings, as shown in class. Ignore any warning messages from NCBI that might be displayed.
from Bio import Entrez
from Bio import SeqIO
def fetch_gb_by_id(rec_id):
handle = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id=rec_id)
gb_record = SeqIO.read(handle, "gb")
handle.close()
return gb_record
mito = fetch_gb_by_id('NC_012920')
assert mito.description == 'Homo sapiens mitochondrion, complete genome.'
C:\pyzo2015a\lib\site-packages\Bio\Entrez\__init__.py:455: UserWarning: Email address is not specified. To make use of NCBI's E-utilities, NCBI requires you to specify your email address with each request. As an example, if your email address is A.N.Other@example.com, you can specify it as follows: from Bio import Entrez Entrez.email = 'A.N.Other@example.com' In case of excessive usage of the E-utilities, NCBI will attempt to contact a user at the email address provided before blocking access to the E-utilities. E-utilities.""", UserWarning)
The record fetched in section 1 is the complete sequence of the mitochondrial genome. We will now extract the genes from this sequence.
A sequence record includes many features, such as CDS, STS, genes and other. To access the features of the mito
SeqRecord, simply use mito.features
.
Let's explore the features field of the record and see some examples by running the following code:
mito_features = mito.features
print(type(mito_features))
print(type(mito_features[0]))
print('************')
print(mito_features[0])
print('************')
print(mito_features[1])
print('************')
print(mito_features[2])
<class 'list'> <class 'Bio.SeqFeature.SeqFeature'> ************ type: source location: [0:16569](+) qualifiers: Key: country, Value: ['United Kingdom: Great Britain'] Key: db_xref, Value: ['taxon:9606'] Key: isolation_source, Value: ['caucasian'] Key: mol_type, Value: ['genomic DNA'] Key: note, Value: ['this is the rCRS'] Key: organelle, Value: ['mitochondrion'] Key: organism, Value: ['Homo sapiens'] Key: tissue_type, Value: ['placenta'] ************ type: D-loop location: join{[0:576](-), [16023:16569](-)} qualifiers: Sub-Features type: D-loop location: [16023:16569](-) qualifiers: type: D-loop location: [0:576](-) qualifiers: ************ type: gene location: [576:647](+) qualifiers: Key: db_xref, Value: ['GeneID:4558', 'HGNC:HGNC:7481', 'MIM:590070'] Key: gene, Value: ['TRNF'] Key: nomenclature, Value: ['Official Symbol: MT-TF | Name: mitochondrially encoded tRNA phenylalanine | Provided by: HGNC:HGNC:7481']
You can learn more about SeqRecord features by visiting the dedicated documentation page.
We can see that there are different types of features, and that each one contains several fields of information. We will only need the type
, the location
and the qualifiers
fields, which We can access like this:
gene = mito.features[2]
print(gene.type)
print(gene.location)
print(gene.qualifiers)
print(type(gene.qualifiers))
gene [576:647](+) {'gene': ['TRNF'], 'nomenclature': ['Official Symbol: MT-TF | Name: mitochondrially encoded tRNA phenylalanine | Provided by: HGNC:HGNC:7481'], 'db_xref': ['GeneID:4558', 'HGNC:HGNC:7481', 'MIM:590070']} <class 'dict'>
a) Write a function that receives a SeqRecord and extract its gene features. The function should return a list of dictionaries, where each dictionary represents one gene. Gene dictionaries will have three keys: start
(start position), end
(end position) and name
(gene name). Complete the code below and use it on the mito
SeqRecord.
def extract_genes(gb_record):
genes = []
for feat in gb_record.features:
# if feature is gene
if feat.type == 'gene':
gene_dict = {}
try:
# get start and end positions
location = feat.location
start = location.start
end = location.end
gene_name = feat.qualifiers['gene'][0]
gene_dict['start'] = start
gene_dict['end'] = end
gene_dict['name'] = gene_name
except:
print('Failed to obtain stats for feature')
print(feat.qualifiers.keys())
if gene_dict != {}:
genes.append(gene_dict)
return genes
mito_genes = extract_genes(mito)
assert len(mito_genes) == 37
assert type(mito_genes[0]) == dict
b) We will now use the result of section 2a to extract the genes from the complete sequence.
Write a function that receives a SeqRecord and a list of genes (as created in section 2a) and returns a list of Seqrecords, each corresponding to a single gene. The description of the output SeqRecords should be the gene name, and the id should be an index running from 1 to the number of genes in the list. Run the function on the mitochondrial data. Complete the code.
def genes_records(gb_record,gene_list):
gene_records = []
for gene in gene_list:
start = gene['start']
end = gene['end']
name = gene['name']
rec = gb_record[start:end]
rec.description = name
rec.id = str(len(gene_records)+1)
gene_records.append(rec)
return gene_records
mito_gene_records = genes_records(mito,mito_genes)
assert len(mito_gene_records) == 37
Print the gene records to an output file of your choice, in fasta format.
out_file = "mito_genes.fasta"
# print to output file
SeqIO.write(mito_gene_records,out_file,'fasta')
37