A Gene Ontology Tutorial in Python - Model Solutions to Exercises

These electronic exercises (with solutions) accompany the book chapter "A Gene Ontology Tutorial in Python" by Alex Warwick Vesztrocy and Christophe Dessimoz, to appear in The Gene Ontology Handbook, C Dessimoz and N Skunca Eds, Springer Humana.

Version: 1.0.2 (Feb 2019): Updated QuickGO API calls and usage of GOATOOLS to version 0.8.12

Section 2 - Querying the Gene Ontology

First, we need to load the GOATools library. This enables us to parse the Gene Ontology (GO) OBO file. For more information on GOATools, see their documentation.

In [ ]:
# Import the OBO parser from GOATools
from goatools import obo_parser

In order to download the GO OBO file, we also require the wget and os libraries.

In [ ]:
import wget
import os

Now, we can download the OBO file into the './data' folder using the following. We are going to download the go-basic.obo version of the ontology, which is guaranteed to be acyclic, which means that annotations can be propagated up the graph.

In [ ]:
go_obo_url = 'http://purl.obolibrary.org/obo/go/go-basic.obo'
data_folder = os.getcwd() + '/data'

# Check if we have the ./data directory already
if(not os.path.isfile(data_folder)):
    # Emulate mkdir -p (no error if folder exists)
    except OSError as e:
        if(e.errno != 17):
            raise e
    raise Exception('Data path (' + data_folder + ') exists as a file. '
                   'Please rename, remove or change the desired location of the data path.')

# Check if the file exists already
if(not os.path.isfile(data_folder+'/go-basic.obo')):
    go_obo = wget.download(go_obo_url, data_folder+'/go-basic.obo')
    go_obo = data_folder+'/go-basic.obo'

The path to the GO OBO file is now stored in the variable go_obo.

In [ ]:

Now we can create a dictionary of the GO terms, using the obo_parser from GOATools.

In [ ]:
go = obo_parser.GODag(go_obo)

 Exercises 2.1

Question 2.1.a.

What is the name of the GO term GO:00048527?

Answer 2.1.a.

Question 2.1.b.

What are the immediate parent(s) of the term GO:00048527?

Answer 2.1.b.

Question 2.1.c.

What are the immediate children of the term GO:00048527?

Answer 2.1.c.

Question 2.1.d.

Recursively find all the parent and child terms of the term GO:00048527. Hint: use your solutions to the previous two questions, with a recursive loop.

Answer 2.1.d.

Question 2.1.e.

How many GO terms have the word “growth” in their name?

Answer 2.1.e.

Question 2.1.f.

What is the deepest common ancestor term of GO:0048527 and GO:0097178?

Answer 2.1.f.

 Exercises 2.2

Using the visualisation function in GOATools, answer the following questions.

Question 2.2.a.

Produce a figure similar to that in Figure 1 of the chapter, for the GO term GO:0097190. From the visualisation, what is the name of this term?

Answer 2.2.a

Question 2.2.b.

Using this figure, what is the most specific term that is in the parent terms of both GO:0097191 and GO:0038034? This is also referred to as the lowest common ancestor.

Answer 2.2.b.

 Exercises 2.3

Using the get_term() function, listed below, answer the following questions.

Note: the get_oboxml() function listed in the chapter, in Source Code 2.1, will no longer work. This is due to an API overhaul of the EMBL-EBI's QuickGO browser.

For the interested reader, it is also possible to use the bioservices library, in order to retrieve information from QuickGO (as well as many other web services).

In [ ]:
from future.standard_library import install_aliases
from urllib.request import urlopen
import json

def get_term(go_id):
        This function retrieves the definition of a given Gene Ontology term,
        using EMBL-EBI's QuickGO browser.
        Input: go_id - a valid Gene Ontology ID, e.g. GO:0048527.
    quickgo_url = "https://www.ebi.ac.uk/QuickGO/services/ontology/go/terms/" + go_id
    ret = urlopen(quickgo_url)
    # Check the response
    if(ret.getcode() == 200):
        term = json.loads(ret.read())
        return term['results'][0]
        raise ValueError("Couldn't receive information from QuickGO. Check GO ID and try again.")

Question 2.3.a.

Find the name and description of the GO term GO:0048527. Hint: print out the dictionary returned by the function and study its structure.

Answer 2.3.a.

 Question 2.3.b.

Look at the difference in the OBO-XML output for the GO terms GO:00048527 and GO:0097178, then generate a table of the synonymous relationships of the term GO:0097178.

Answer 2.3.b.

Section 3 - Retrieving GO annotations

In this section we will look at how to manipulate the Gene Association File (GAF) standard, using a parser from the BioPython package.

In [ ]:
import Bio.UniProt.GOA as GOA

First we need to download a GAF file from the EBI FTP website, which hosts the current and all previous UniProt-GOA annotations. The links to these can be found on the EBI GOA Downloads page.

As an example, we are going to download the reduced GAF file containing gene association data for Arabidopsis Thaliana.

In [ ]:
import os
from ftplib import FTP
arab_uri = '/pub/databases/GO/goa/ARABIDOPSIS/goa_arabidopsis.gaf.gz'
arab_fn = arab_uri.split('/')[-1]

# Check if the file exists already
arab_gaf = os.path.join(data_folder, arab_fn)
if(not os.path.isfile(arab_gaf)):
    # Login to FTP server
    ebi_ftp = FTP('ftp.ebi.ac.uk')
    ebi_ftp.login() # Logs in anonymously
    # Download
    with open(arab_gaf,'wb') as arab_fp:
        ebi_ftp.retrbinary('RETR {}'.format(arab_uri), arab_fp.write)
    # Logout from FTP server

Now we can load all the annotations into a dictionary, using the iterator from the BioPython package (Bio.UniProt.GOA.gafiterator).

In [ ]:
import gzip

# File is a gunzip file, so we need to open it in this way
with gzip.open(arab_gaf, 'rt') as arab_gaf_fp:
    arab_funcs = {}  # Initialise the dictionary of functions
    # Iterate on each function using Bio.UniProt.GOA library.
    for entry in GOA.gafiterator(arab_gaf_fp):
        uniprot_id = entry.pop('DB_Object_ID')
        arab_funcs[uniprot_id] = entry

Now we have a structure of the annotations which can manipulated. Each of the entries have been loaded in the following form.

In [ ]:

Exercises 3.1

Question 3.1.a.

Find the total number of annotations for Arabidopsis thaliana with NOT qualifiers. What is this as a percentage of the total number of annotations for this species?

Answer 3.1.a.

Question 3.1.b.

How many genes (of Arabidopsis thaliana) have the annotation GO:0048527?

Answer 3.1.b.

Question 3.1.c.

Generate a list of annotated proteins which have the word “growth” in their name.

Answer 3.1.c.

Question 3.1.d.

There are 21 evidence codes used in the Gene Ontology project. As discussed in Chapter 3, many of these are inferred, either by curators or automatically. Find the counts of each evidence code for in the Arabidopsis thaliana annotation file.

Answer 3.1.d.

 Question 3.1.f (Extension Question).

To help visualise the counts of each evidence code found in the previous question, produce a pie chart showing the proportion of annotations with each evidence code for the Arabidopsis thaliana annotations dataset.

Answer 3.1.f.

Section 4 - GO enrichment or depletion analysis

In this section, the GOEnrichmentStudy() function from the GOATools library will be used to perform GO enrichment analysis.

In [ ]:
from goatools.go_enrichment import GOEnrichmentStudy

Exercises 4.1

Perform an enrichment analysis using the list of genes with the "growth" keyword from exercise 3.1.c, and the GO structure from exercise 2.1.

The population is the functions observed in the Arabidopsis thaliana GOA file.

In [ ]:
pop = arab_funcs.keys()

Then, we need to create a dictionary of genes with their UniProt ID as a key and their set of GO annotations as the values.

In [ ]:
assoc = {}

for x in arab_funcs:
    if x not in assoc:
        assoc[x] = set()

Now, the study set here is those genes with the "growth" keyword, found previously.

In [ ]:
study = growth_dict.keys()

Question 4.1.a.

Which GO term is most significantly enriched or depleted? Does this make sense?

Answer 4.1.a.

Question 4.1.b.

How many terms are enriched, when using the Bonferroni corrected method and a p-value $\le$ 0.01?

Answer 4.1.b.

Question 4.1.c.

How many terms are enriched with false discovery rate (a.k.a. q-value) $\le$ 0.01?

Answer 4.1.c.

Section 5 - Computing basic semantic similarities between GO terms

In this section we look at how to compute semantic similarity between GO terms.

In [ ]:
from collections import Counter
class TermCounts():
        TermCounts counts the term counts for each 
    def __init__(self, go, annots):
            Initialise the counts and  
        # Backup
        self._go = go
        # Initialise the counters
        self._counts = Counter()
        self._aspect_counts = Counter()
        # Fill the counters...
        self._count_terms(go, annots)
    def _count_terms(self, go, annots):
            Fills in the counts and overall aspect counts.
        for x in annots:
            # Extract term information
            go_id = annots[x]['GO_ID']
            namespace = go[go_id].namespace

            self._counts[go_id] += 1
            rec = go[go_id]
            parents = rec.get_all_parents()
            for p in parents:
                self._counts[p] += 1
            self._aspect_counts[namespace] += 1
    def get_count(self, go_id):
            Returns the count of that GO term observed in the annotations.
        return self._counts[go_id]
    def get_total_count(self, aspect):
            Gets the total count that's been precomputed.
        return self._aspect_counts[aspect]
    def get_term_freq(self, go_id):
            Returns the frequency at which a particular GO term has 
            been observed in the annotations.
            namespace = self._go[go_id].namespace
            freq = float(self.get_count(go_id)) / float(self.get_total_count(namespace))
        except ZeroDivisionError:
            freq = 0
        return freq

Question 5.1.a.

GO:0048364 (root development) and GO:0044707 (single-multicellular organism process) are two GO terms taken from Figure 1. Calculate the semantic similarity between them based on the inverse of the semantic distance (number of branches separating them).

 Answer 5.1.a.

Question 5.1.b.

Calculate the the information content (IC) of the GO term GO:0048364 (root development), based on the frequency of observation in Arabidopsis thaliana.

Answer 5.1.b.

Question 5.1.c.

Calculate the Resnik similarity measure between the same two terms as in 5.1.a.

Answer 5.1.c.