These electronic exercises (with solutions) accompany the book chapter "A Gene Ontology Tutorial in Python" by Alex Warwick Vesztrocy and Christophe Dessimoz, to appear in The Gene Ontology Handbook, C Dessimoz and N Skunca Eds, Springer Humana.
Version: 1.0.2 (Feb 2019): Updated QuickGO API calls and usage of GOATOOLS to version 0.8.12
First, we need to load the GOATools library. This enables us to parse the Gene Ontology (GO) OBO file. For more information on GOATools, see their documentation.
# Import the OBO parser from GOATools
from goatools import obo_parser
In order to download the GO OBO file, we also require the wget
and os
libraries.
import wget
import os
Now, we can download the OBO file into the './data'
folder using the following. We are going to download the go-basic.obo
version of the ontology, which is guaranteed to be acyclic, which means that annotations can be propagated up the graph.
go_obo_url = 'http://purl.obolibrary.org/obo/go/go-basic.obo'
data_folder = os.getcwd() + '/data'
# Check if we have the ./data directory already
if(not os.path.isfile(data_folder)):
# Emulate mkdir -p (no error if folder exists)
try:
os.mkdir(data_folder)
except OSError as e:
if(e.errno != 17):
raise e
else:
raise Exception('Data path (' + data_folder + ') exists as a file. '
'Please rename, remove or change the desired location of the data path.')
# Check if the file exists already
if(not os.path.isfile(data_folder+'/go-basic.obo')):
go_obo = wget.download(go_obo_url, data_folder+'/go-basic.obo')
else:
go_obo = data_folder+'/go-basic.obo'
The path to the GO OBO file is now stored in the variable go_obo
.
print(go_obo)
Now we can create a dictionary of the GO terms, using the obo_parser
from GOATools.
go = obo_parser.GODag(go_obo)
What are the immediate parent(s) of the term GO:00048527
?
What are the immediate children of the term GO:00048527
?
Recursively find all the parent and child terms of the term GO:00048527
. Hint: use your solutions to the previous two questions, with a recursive loop.
How many GO terms have the word “growth” in their name?
What is the deepest common ancestor term of GO:0048527
and GO:0097178
?
Using this figure, what is the most specific term that is in the parent terms of both GO:0097191
and GO:0038034
? This is also referred to as the lowest common ancestor.
Using the get_term()
function, listed below, answer the following questions.
Note: the get_oboxml()
function listed in the chapter, in Source Code 2.1, will no longer work. This is due to an API overhaul of the EMBL-EBI's QuickGO browser.
For the interested reader, it is also possible to use the bioservices
library, in order to retrieve information from QuickGO (as well as many other web services).
from future.standard_library import install_aliases
install_aliases()
from urllib.request import urlopen
import json
def get_term(go_id):
"""
This function retrieves the definition of a given Gene Ontology term,
using EMBL-EBI's QuickGO browser.
Input: go_id - a valid Gene Ontology ID, e.g. GO:0048527.
"""
quickgo_url = "https://www.ebi.ac.uk/QuickGO/services/ontology/go/terms/" + go_id
ret = urlopen(quickgo_url)
# Check the response
if(ret.getcode() == 200):
term = json.loads(ret.read())
return term['results'][0]
else:
raise ValueError("Couldn't receive information from QuickGO. Check GO ID and try again.")
Find the name and description of the GO term GO:0048527
. Hint: print out the dictionary returned by the function and study its structure.
Look at the difference in the OBO-XML output for the GO terms GO:00048527
and GO:0097178
, then generate a table of the synonymous relationships of the term GO:0097178
.
In this section we will look at how to manipulate the Gene Association File (GAF) standard, using a parser from the BioPython package.
import Bio.UniProt.GOA as GOA
First we need to download a GAF file from the EBI FTP website, which hosts the current and all previous UniProt-GOA annotations. The links to these can be found on the EBI GOA Downloads page.
As an example, we are going to download the reduced GAF file containing gene association data for Arabidopsis Thaliana.
import os
from ftplib import FTP
arab_uri = '/pub/databases/GO/goa/ARABIDOPSIS/goa_arabidopsis.gaf.gz'
arab_fn = arab_uri.split('/')[-1]
# Check if the file exists already
arab_gaf = os.path.join(data_folder, arab_fn)
if(not os.path.isfile(arab_gaf)):
# Login to FTP server
ebi_ftp = FTP('ftp.ebi.ac.uk')
ebi_ftp.login() # Logs in anonymously
# Download
with open(arab_gaf,'wb') as arab_fp:
ebi_ftp.retrbinary('RETR {}'.format(arab_uri), arab_fp.write)
# Logout from FTP server
ebi_ftp.quit()
Now we can load all the annotations into a dictionary, using the iterator from the BioPython package (Bio.UniProt.GOA.gafiterator
).
import gzip
# File is a gunzip file, so we need to open it in this way
with gzip.open(arab_gaf, 'rt') as arab_gaf_fp:
arab_funcs = {} # Initialise the dictionary of functions
# Iterate on each function using Bio.UniProt.GOA library.
for entry in GOA.gafiterator(arab_gaf_fp):
uniprot_id = entry.pop('DB_Object_ID')
arab_funcs[uniprot_id] = entry
Now we have a structure of the annotations which can manipulated. Each of the entries have been loaded in the following form.
print(arab_funcs[list(arab_funcs.keys())[0]])
Find the total number of annotations for Arabidopsis thaliana with NOT
qualifiers. What is this as a percentage of the total number of annotations for this species?
How many genes (of Arabidopsis thaliana) have the annotation GO:0048527
?
Generate a list of annotated proteins which have the word “growth” in their name.
There are 21 evidence codes used in the Gene Ontology project. As discussed in Chapter 3, many of these are inferred, either by curators or automatically. Find the counts of each evidence code for in the Arabidopsis thaliana annotation file.
To help visualise the counts of each evidence code found in the previous question, produce a pie chart showing the proportion of annotations with each evidence code for the Arabidopsis thaliana annotations dataset.
In this section, the GOEnrichmentStudy()
function from the GOATools library will be used to perform GO enrichment analysis.
from goatools.go_enrichment import GOEnrichmentStudy
Perform an enrichment analysis using the list of genes with the "growth" keyword from exercise 3.1.c, and the GO structure from exercise 2.1.
The population is the functions observed in the Arabidopsis thaliana GOA file.
pop = arab_funcs.keys()
Then, we need to create a dictionary of genes with their UniProt ID as a key and their set of GO annotations as the values.
assoc = {}
for x in arab_funcs:
if x not in assoc:
assoc[x] = set()
assoc[x].add(str(arab_funcs[x]['GO_ID']))
Now, the study set here is those genes with the "growth" keyword, found previously.
study = growth_dict.keys()
Which GO term is most significantly enriched or depleted? Does this make sense?
How many terms are enriched, when using the Bonferroni corrected method and a p-value $\le$ 0.01?
How many terms are enriched with false discovery rate (a.k.a. q-value) $\le$ 0.01?
In this section we look at how to compute semantic similarity between GO terms.
from collections import Counter
class TermCounts():
'''
TermCounts counts the term counts for each
'''
def __init__(self, go, annots):
'''
Initialise the counts and
'''
# Backup
self._go = go
# Initialise the counters
self._counts = Counter()
self._aspect_counts = Counter()
# Fill the counters...
self._count_terms(go, annots)
def _count_terms(self, go, annots):
'''
Fills in the counts and overall aspect counts.
'''
for x in annots:
# Extract term information
go_id = annots[x]['GO_ID']
namespace = go[go_id].namespace
self._counts[go_id] += 1
rec = go[go_id]
parents = rec.get_all_parents()
for p in parents:
self._counts[p] += 1
self._aspect_counts[namespace] += 1
def get_count(self, go_id):
'''
Returns the count of that GO term observed in the annotations.
'''
return self._counts[go_id]
def get_total_count(self, aspect):
'''
Gets the total count that's been precomputed.
'''
return self._aspect_counts[aspect]
def get_term_freq(self, go_id):
'''
Returns the frequency at which a particular GO term has
been observed in the annotations.
'''
try:
namespace = self._go[go_id].namespace
freq = float(self.get_count(go_id)) / float(self.get_total_count(namespace))
except ZeroDivisionError:
freq = 0
return freq
GO:0048364
(root development) and GO:0044707
(single-multicellular organism process) are two GO terms taken from Figure 1. Calculate the semantic similarity between them based on the inverse of the semantic distance (number of branches separating them).
Calculate the the information content (IC) of the GO term GO:0048364
(root development), based on the frequency of observation in Arabidopsis thaliana.
Calculate the Resnik similarity measure between the same two terms as in 5.1.a.