These electronic exercises (with solutions) accompany the book chapter "A Gene Ontology Tutorial in Python" by Alex Warwick Vesztrocy and Christophe Dessimoz, to appear in The Gene Ontology Handbook, C Dessimoz and N Skunca Eds, Springer Humana.
Version: 1.0.2 (Feb 2019): Updated QuickGO API calls and usage of GOATOOLS to version 0.8.12
First, we need to load the GOATools library. This enables us to parse the Gene Ontology (GO) OBO file. For more information on GOATools, see their documentation.
# Import the OBO parser from GOATools
from goatools import obo_parser
In order to download the GO OBO file, we also require the wget
and os
libraries.
import wget
import os
Now, we can download the OBO file into the './data'
folder using the following. We are going to download the go-basic.obo
version of the ontology, which is guaranteed to be acyclic, which means that annotations can be propagated up the graph.
go_obo_url = 'http://purl.obolibrary.org/obo/go/go-basic.obo'
data_folder = os.getcwd() + '/data'
# Check if we have the ./data directory already
if(not os.path.isfile(data_folder)):
# Emulate mkdir -p (no error if folder exists)
try:
os.mkdir(data_folder)
except OSError as e:
if(e.errno != 17):
raise e
else:
raise Exception('Data path (' + data_folder + ') exists as a file. '
'Please rename, remove or change the desired location of the data path.')
# Check if the file exists already
if(not os.path.isfile(data_folder+'/go-basic.obo')):
go_obo = wget.download(go_obo_url, data_folder+'/go-basic.obo')
else:
go_obo = data_folder+'/go-basic.obo'
The path to the GO OBO file is now stored in the variable go_obo
.
print(go_obo)
/Users/alex/go-handbook-live/data/go-basic.obo
Now we can create a dictionary of the GO terms, using the obo_parser
from GOATools.
go = obo_parser.GODag(go_obo)
/Users/alex/go-handbook-live/data/go-basic.obo: fmt(1.2) rel(2019-01-27) 47,381 GO Terms
go_id = 'GO:0048527'
go_term = go[go_id]
print(go_term)
GO:0048527 level-05 depth-06 lateral root development [biological_process]
print('GO term name: {}'.format(go_term.name))
print('GO term namespace: {}'.format(go_term.namespace))
GO term name: lateral root development GO term namespace: biological_process
What are the immediate parent(s) of the term GO:00048527
?
for term in go_term.parents:
print(term)
GO:0048528 level-04 depth-05 post-embryonic root development [biological_process]
What are the immediate children of the term GO:00048527
?
for term in go_term.children:
print(term)
Recursively find all the parent and child terms of the term GO:00048527
. Hint: use your solutions to the previous two questions, with a recursive loop.
First, we can define functions to find the parents and children of a GO term, as well as one that finds both - the transitive closure.
def transitive_closure(go_term, go):
go_term_set = set()
find_parents(go_term, go, go_term_set)
find_children(go_term, go, go_term_set)
return go_term_set
def find_parents(term1, go, go_term_set={}, ret=False):
for term2 in term1.parents:
go_term_set.update({term2})
# Recurse on term to find all parents
find_parents(term2, go, go_term_set)
if(ret):
return go_term_set
def find_children(term1, go, go_term_set={}, ret=False):
for term2 in term1.children:
go_term_set.update({term2})
# Recurse on term to find all children
find_children(term2, go, go_term_set)
if(ret):
return go_term_set
Now we can create the transitive closure as a set by calling the following.
go_term_set = transitive_closure(go_term, go)
We can now print this out using the "pretty print"f eature that is inherited from GOATools.
for term in go_term_set:
print(term)
GO:0099402 level-03 depth-03 plant organ development [biological_process] GO:0048364 level-04 depth-04 root development [biological_process] GO:0048856 level-02 depth-02 anatomical structure development [biological_process] GO:0090696 level-03 depth-04 post-embryonic plant organ development [biological_process] GO:0032501 level-01 depth-01 multicellular organismal process [biological_process] GO:0032502 level-01 depth-01 developmental process [biological_process] GO:0048528 level-04 depth-05 post-embryonic root development [biological_process] GO:0009791 level-02 depth-02 post-embryonic development [biological_process] GO:0008150 level-00 depth-00 biological_process [biological_process]
An alternative method is by using an inbuilt function of GOATools.
rec = go[go_id]
parents = rec.get_all_parents()
children = rec.get_all_children()
for term in parents.union(children):
print(go[term])
GO:0032502 level-01 depth-01 developmental process [biological_process] GO:0048856 level-02 depth-02 anatomical structure development [biological_process] GO:0090696 level-03 depth-04 post-embryonic plant organ development [biological_process] GO:0009791 level-02 depth-02 post-embryonic development [biological_process] GO:0008150 level-00 depth-00 biological_process [biological_process] GO:0099402 level-03 depth-03 plant organ development [biological_process] GO:0048528 level-04 depth-05 post-embryonic root development [biological_process] GO:0032501 level-01 depth-01 multicellular organismal process [biological_process] GO:0048364 level-04 depth-04 root development [biological_process]
How many GO terms have the word “growth” in their name?
To do this, we can loop around every single GO term and check their name.
growth_count = 0
for go_term in go.values():
if 'growth' in go_term.name:
growth_count += 1
print('Number of GO terms with "growth" in their name: {}'.format(growth_count))
Number of GO terms with "growth" in their name: 626
What is the deepest common ancestor term of GO:0048527
and GO:0097178
?
First, we can write a function that finds the common GO terms.
def common_parent_go_ids(terms, go):
'''
This function finds the common ancestors in the GO
tree of the list of terms in the input.
'''
# Find candidates from first
rec = go[terms[0]]
candidates = rec.get_all_parents()
candidates.update({terms[0]})
# Find intersection with second to nth term
for term in terms[1:]:
rec = go[term]
parents = rec.get_all_parents()
parents.update({term})
# Find the intersection with the candidates, and update.
candidates.intersection_update(parents)
return candidates
Then we can define the deepest common ancestor of two terms, as follows:
def deepest_common_ancestor(terms, go):
'''
This function gets the nearest common ancestor
using the above function.
Only returns single most specific - assumes unique exists.
'''
# Take the element at maximum depth.
return max(common_parent_go_ids(terms, go), key=lambda t: go[t].depth)
Then we can find the deepest common ancestor of the two terms in question.
go_id1 = 'GO:0097178'
go_id_id1_dca = deepest_common_ancestor([go_id, go_id1], go)
print('The nearest common ancestor of\n\t{} ({})\nand\n\t{} ({})\nis\n\t{} ({})'
.format(go_id, go[go_id].name,
go_id1, go[go_id1].name,
go_id_id1_dca, go[go_id_id1_dca].name))
The nearest common ancestor of GO:0048527 (lateral root development) and GO:0097178 (ruffle assembly) is GO:0008150 (biological_process)
We first have to create the query to return the record for the GO term GO:0097190
.
go_id2 = 'GO:0097192'
rec = go[go_id2]
We then need to decide where to save the image file.
lineage_png = go_id2 + '-lineage.png'
Then we can draw the lineage, as so (you may need to install the pygraphviz library via pip if needed):
go.draw_lineage([rec], lineage_img=lineage_png)
lineage info for terms ['GO:0097192'] written to GO:0097192-lineage.png
Note, the GOATools draw_lineage
function will output a PNG to ./GO_lineage.png
, or whatever is defined by the argument lineage_img
. This can be viewed with your system viewer, or included here by running something similar to the following code. It may be necessary to double click on the image and scroll, in order to see the detail.
from IPython.display import Image
Image(lineage_png)
Using this figure, what is the most specific term that is in the parent terms of both GO:0097191
and GO:0038034
? This is also referred to as the lowest common ancestor.
This is possible to read off the graph - the term GO:0007165
, signal transduction.
Using the get_term()
function, listed below, answer the following questions.
Note: the get_oboxml()
function listed in the chapter, in Source Code 2.1, will no longer work. This is due to an API overhaul of the EMBL-EBI's QuickGO browser.
For the interested reader, it is also possible to use the bioservices
library, in order to retrieve information from QuickGO (as well as many other web services).
from future.standard_library import install_aliases
install_aliases()
from urllib.request import urlopen
import json
def get_term(go_id):
"""
This function retrieves the definition of a given Gene Ontology term,
using EMBL-EBI's QuickGO browser.
Input: go_id - a valid Gene Ontology ID, e.g. GO:0048527.
"""
quickgo_url = "https://www.ebi.ac.uk/QuickGO/services/ontology/go/terms/" + go_id
ret = urlopen(quickgo_url)
# Check the response
if(ret.getcode() == 200):
term = json.loads(ret.read())
return term['results'][0]
else:
raise ValueError("Couldn't receive information from QuickGO. Check GO ID and try again.")
Find the name and description of the GO term GO:0048527
. Hint: print out the dictionary returned by the function and study its structure.
First, let's get the term information.
term = get_term(go_id)
It might be useful to study the structure of this dictionary, in order to answer this question.
term
{'id': 'GO:0048527', 'isObsolete': False, 'name': 'lateral root development', 'definition': {'text': 'The process whose specific outcome is the progression of the lateral root over time, from its formation to the mature structure. A lateral root is one formed from pericycle cells located on the xylem radius of the root, as opposed to the initiation of the main root from the embryo proper.'}, 'children': [{'id': 'GO:1901332', 'relation': 'negatively_regulates'}, {'id': 'GO:1901333', 'relation': 'positively_regulates'}, {'id': 'GO:0010102', 'relation': 'part_of'}, {'id': 'GO:2000023', 'relation': 'regulates'}, {'id': 'GO:1902089', 'relation': 'part_of'}], 'aspect': 'biological_process', 'usage': 'Unrestricted'}
Now we can extract the relevant data.
print('Description of {} is:\n\t{}'.format(go_id, term['definition']['text']))
Description of GO:0048527 is: The process whose specific outcome is the progression of the lateral root over time, from its formation to the mature structure. A lateral root is one formed from pericycle cells located on the xylem radius of the root, as opposed to the initiation of the main root from the embryo proper.
Look at the difference in the OBO-XML output for the GO terms GO:00048527
and GO:0097178
, then generate a table of the synonymous relationships of the term GO:0097178
.
Firstly, generate the OBO-XML dictionary for this new term.
term1 = get_term(go_id1)
We can then extract the synonyms as follows.
term1
{'id': 'GO:0097178', 'isObsolete': False, 'name': 'ruffle assembly', 'definition': {'text': 'The aggregation, arrangement and bonding together of a set of components to form a ruffle, a projection at the leading edge of a crawling cell; the protrusions are supported by a microfilament meshwork. The formation of ruffles (also called membrane ruffling) is thought to be controlled by a group of enzymes known as Rho GTPases, specifically RhoA, Rac1 and cdc42.', 'xrefs': [{'dbCode': 'PMID', 'dbId': '12556481'}, {'dbCode': 'URL', 'dbId': 'http:en.wikipedia.org/wiki/Membrane_ruffling'}]}, 'synonyms': [{'name': 'membrane ruffle formation', 'type': 'exact'}, {'name': 'membrane ruffling', 'type': 'exact'}], 'children': [{'id': 'GO:1900029', 'relation': 'positively_regulates'}, {'id': 'GO:1900028', 'relation': 'negatively_regulates'}, {'id': 'GO:1900027', 'relation': 'regulates'}], 'aspect': 'biological_process', 'usage': 'Unrestricted'}
term1['synonyms']
[{'name': 'membrane ruffle formation', 'type': 'exact'}, {'name': 'membrane ruffling', 'type': 'exact'}]
synonyms = term1['synonyms']
print(synonyms)
[{'name': 'membrane ruffle formation', 'type': 'exact'}, {'name': 'membrane ruffling', 'type': 'exact'}]
As an example, this can be nicely formatted using the AsciiTable
module in the terminaltables package (which you may need to install).
from terminaltables import AsciiTable
# Initialise table data and set header
table_data = [['Type', 'Synonym']]
for synonym in synonyms:
table_data.append([synonym['type'], synonym['name']])
print(AsciiTable(table_data).table)
+-------+---------------------------+ | Type | Synonym | +-------+---------------------------+ | exact | membrane ruffle formation | | exact | membrane ruffling | +-------+---------------------------+
In this section we will look at how to manipulate the Gene Association File (GAF) standard, using a parser from the BioPython package.
import Bio.UniProt.GOA as GOA
First we need to download a GAF file from the EBI FTP website, which hosts the current and all previous UniProt-GOA annotations. The links to these can be found on the EBI GOA Downloads page.
As an example, we are going to download the reduced GAF file containing gene association data for Arabidopsis Thaliana.
import os
from ftplib import FTP
arab_uri = '/pub/databases/GO/goa/ARABIDOPSIS/goa_arabidopsis.gaf.gz'
arab_fn = arab_uri.split('/')[-1]
# Check if the file exists already
arab_gaf = os.path.join(data_folder, arab_fn)
if(not os.path.isfile(arab_gaf)):
# Login to FTP server
ebi_ftp = FTP('ftp.ebi.ac.uk')
ebi_ftp.login() # Logs in anonymously
# Download
with open(arab_gaf,'wb') as arab_fp:
ebi_ftp.retrbinary('RETR {}'.format(arab_uri), arab_fp.write)
# Logout from FTP server
ebi_ftp.quit()
Now we can load all the annotations into a dictionary, using the iterator from the BioPython package (Bio.UniProt.GOA.gafiterator
).
import gzip
# File is a gunzip file, so we need to open it in this way
with gzip.open(arab_gaf, 'rt') as arab_gaf_fp:
arab_funcs = {} # Initialise the dictionary of functions
# Iterate on each function using Bio.UniProt.GOA library.
for entry in GOA.gafiterator(arab_gaf_fp):
uniprot_id = entry.pop('DB_Object_ID')
arab_funcs[uniprot_id] = entry
Now we have a structure of the annotations which can manipulated. Each of the entries have been loaded in the following form.
print(arab_funcs[list(arab_funcs.keys())[0]])
{'DB': 'UniProtKB', 'DB_Object_Symbol': 'PIAL1', 'Qualifier': [''], 'GO_ID': 'GO:0051176', 'DB:Reference': ['PMID:25415977'], 'Evidence': 'IGI', 'With': ['AGI_LocusCode:AT5G41580'], 'Aspect': 'P', 'DB_Object_Name': 'E4 SUMO-protein ligase PIAL1', 'Synonym': ['PIAL1', 'EMB3001', 'At1g08910', 'F7G19.21'], 'DB_Object_Type': 'protein', 'Taxon_ID': ['taxon:3702'], 'Date': '20141218', 'Assigned_By': 'TAIR', 'Annotation_Extension': '', 'Gene_Product_Form_ID': ''}
Find the total number of annotations for Arabidopsis thaliana with NOT
qualifiers. What is this as a percentage of the total number of annotations for this species?
This is possible by looping through each of the values and checking whether NOT
is listed as one of the qualifiers.
Even though here it doesn't make a difference if we check either
if 'NOT' in func['Qualifier']
or
if 'NOT' in func['Qualifier'][0]
the first is preferred. This is because there can be multiple qualifiers for a given annotation.
not_count = 0
total_count = len(arab_funcs)
for func in arab_funcs.values():
if 'NOT' in func['Qualifier']:
not_count += 1
print('Total count of NOT qualifiers: {} ({}%)'.format(not_count, round(((float(not_count)/float(total_count))*100),2)))
print('Total number of annotations: {}'.format(total_count))
Total count of NOT qualifiers: 802 (3.17%) Total number of annotations: 25289
How many genes (of Arabidopsis thaliana) have the annotation GO:0048527
?
This is done by checking each annotation if its GO_ID
entry is equal to the required term's ID.
Further, here there has been a check on the taxonomic ID. This isn't strictly necessary, but would be required if the annotations database contained multiple species.
arab_tax_id = 3702 # This isn't necessary here, but would be with the full data set.
annot_count = 0
counted_gene = []
for uniprot_id in arab_funcs:
if('taxon:' + str(arab_tax_id) in arab_funcs[uniprot_id]['Taxon_ID']):
if((arab_funcs[uniprot_id]['GO_ID'] == go_id)):
counted_gene.append(uniprot_id)
annot_count += 1
del counted_gene
We can now find the number of genes:
print(annot_count)
17
Generate a list of annotated proteins which have the word “growth” in their name.
This can be generated in the following way, checking each annotation's DB_Object_Name
field for the keyword.
keyword = 'growth'
growth_dict = {x: arab_funcs[x]
for x in arab_funcs
if keyword in arab_funcs[x]['DB_Object_Name']}
print('UniProt IDs of annotations with "growth" in their name:')
for annot in growth_dict:
print("\t - " + annot)
print("Total: {}".format(len(growth_dict)))
UniProt IDs of annotations with "growth" in their name: - A0A1P8B471 - B3H5J1 - F4KHI3 - Q3C1C7 - Q3E880 - Q6DSU1 - Q6ID76 - Q6NNL3 - Q93VK8 - Q9FGF6 - Q9FHF5 - Q9LI64 - Q9SI57 Total: 13
There are 21 evidence codes used in the Gene Ontology project. As discussed in Chapter 3, many of these are inferred, either by curators or automatically. Find the counts of each evidence code for in the Arabidopsis thaliana annotation file.
This can be done by looping through each of the annotations, counting each of thier evidence codes and placing them into a dictionary.
evidence_count = {}
for annotation in arab_funcs:
evidence = arab_funcs[annotation]['Evidence']
if(evidence not in evidence_count):
evidence_count[evidence] = 1
else:
evidence_count[evidence] += 1
The counts are then:
table_data = [['Evidence Code', 'Count']]
for code in sorted(evidence_count.keys()):
table_data.append([code, str(evidence_count[code])])
print(AsciiTable(table_data).table)
+---------------+-------+ | Evidence Code | Count | +---------------+-------+ | IBA | 4730 | | IC | 1 | | IDA | 1911 | | IEA | 11020 | | IEP | 551 | | IGI | 326 | | IMP | 2068 | | IPI | 227 | | ISM | 18 | | ISS | 332 | | NAS | 11 | | ND | 3295 | | RCA | 463 | | TAS | 336 | +---------------+-------+
(All others are zero.)
To help visualise the counts of each evidence code found in the previous question, produce a pie chart showing the proportion of annotations with each evidence code for the Arabidopsis thaliana annotations dataset.
In order to draw the pie chart, we need to convert the counts into percentages.
evidence_percent = {}
for code in evidence_count:
evidence_percent[code] = ((float(evidence_count[code]) /
float(total_count))
*100)
Now we can have a look at the counts and percentages for each evidence code.
table_data = [['Evidence Code', 'Count', 'Percentage (%)']]
for code in sorted(evidence_count.keys()):
table_data.append([code, str(evidence_count[code]), str(round(evidence_percent[code],2))])
print(AsciiTable(table_data).table)
+---------------+-------+----------------+ | Evidence Code | Count | Percentage (%) | +---------------+-------+----------------+ | IBA | 4730 | 18.7 | | IC | 1 | 0.0 | | IDA | 1911 | 7.56 | | IEA | 11020 | 43.58 | | IEP | 551 | 2.18 | | IGI | 326 | 1.29 | | IMP | 2068 | 8.18 | | IPI | 227 | 0.9 | | ISM | 18 | 0.07 | | ISS | 332 | 1.31 | | NAS | 11 | 0.04 | | ND | 3295 | 13.03 | | RCA | 463 | 1.83 | | TAS | 336 | 1.33 | +---------------+-------+----------------+
Now we are going to plot this information as a pie chart, in order to better visualise the lack of experimental evidence.
# Declare matplotlib as inline and import pyplot
%matplotlib inline
from matplotlib import pyplot
# Setup the figure
fig = pyplot.figure(1, figsize=(8,8), dpi=96)
ax=fig.add_axes([0.1,0.1,0.8,0.8])
# Extract the lables / percentages
labels = evidence_percent.keys()
fracs = evidence_percent.values()
# Make IEA obvious by "exploding" it
explode = [int('IEA' in x)*0.1 for x in labels]
# Plot the pie chart
patches, texts = ax.pie(list(fracs), explode=explode, startangle=90, labeldistance=1.1)
# Add
ax.legend(patches, labels, bbox_to_anchor=(1.2, 0.75), fontsize=12)
pyplot.show()
In this section, the GOEnrichmentStudy()
function from the GOATools library will be used to perform GO enrichment analysis.
from goatools.go_enrichment import GOEnrichmentStudy
Perform an enrichment analysis using the list of genes with the "growth" keyword from exercise 3.1.c, and the GO structure from exercise 2.1.
The population is the functions observed in the Arabidopsis thaliana GOA file.
pop = arab_funcs.keys()
Then, we need to create a dictionary of genes with their UniProt ID as a key and their set of GO annotations as the values.
assoc = {}
for x in arab_funcs:
if x not in assoc:
assoc[x] = set()
assoc[x].add(str(arab_funcs[x]['GO_ID']))
Now, the study set here is those genes with the "growth" keyword, found previously.
study = growth_dict.keys()
Which GO term is most significantly enriched or depleted? Does this make sense?
Possible methods for the GOEnrichmentStudy are:
methods = ["bonferroni", "sidak", "holm", "fdr"]
It is possible to run on all methods, or just a subset.
g = GOEnrichmentStudy(pop, assoc, go,
propagate_counts=True,
alpha=0.05,
methods=methods)
g_res = g.run_study(study)
fisher module not installed. Falling back on scipy.stats.fisher_exact
Propagating term counts to parents ..
100% 25,289 of 25,289 population items found in association 100% 13 of 13 study items found in association 100% 13 of 13 study items found in population(25289) Calculating 5,364 uncorrected p-values using fisher_scipy_stats 5,364 GO terms are associated with 25,289 of 25,289 population items 40 GO terms are associated with 13 of 13 study items 3 GO terms found significant (< 0.05=alpha) after multitest correction: local bonferroni 3 GO terms found significant (< 0.05=alpha) after multitest correction: local sidak 3 GO terms found significant (< 0.05=alpha) after multitest correction: local holm
Generate p-value distribution for FDR based on resampling (this might take a while) Sample 0 / 500: p-value 0.0010278710533060788
fisher module not installed. Falling back on scipy.stats.fisher_exact
Sample 10 / 500: p-value 0.0015414407798976498 Sample 20 / 500: p-value 0.0005140574953738786 Sample 30 / 500: p-value 0.0005140574953738786 Sample 40 / 500: p-value 3.303147547493567e-05 Sample 50 / 500: p-value 0.0005140574953738786 Sample 60 / 500: p-value 0.0005140574953738786 Sample 70 / 500: p-value 0.0005140574953738786 Sample 80 / 500: p-value 0.0076852972242090695 Sample 90 / 500: p-value 0.0010278710533060788 Sample 100 / 500: p-value 0.002054766781135086 Sample 110 / 500: p-value 0.0010278710533060788 Sample 120 / 500: p-value 0.0005140574953738786 Sample 130 / 500: p-value 9.826609994798967e-05 Sample 140 / 500: p-value 0.000318804720807801 Sample 150 / 500: p-value 0.0010278710533060788 Sample 160 / 500: p-value 4.150813978478498e-05 Sample 170 / 500: p-value 0.0010278710533060788 Sample 180 / 500: p-value 0.00026027744906719874 Sample 190 / 500: p-value 0.003593283493731873 Sample 200 / 500: p-value 0.0010278710533060788 Sample 210 / 500: p-value 0.002054766781135086 Sample 220 / 500: p-value 0.0005140574953738786 Sample 230 / 500: p-value 0.0010278710533060788 Sample 240 / 500: p-value 0.0015414407798976498 Sample 250 / 500: p-value 0.004105635653855762 Sample 260 / 500: p-value 0.002054766781135086 Sample 270 / 500: p-value 0.002054766781135086 Sample 280 / 500: p-value 0.0005140574953738786 Sample 290 / 500: p-value 0.0005140574953738786 Sample 300 / 500: p-value 0.003593283493731873 Sample 310 / 500: p-value 0.0005140574953738786 Sample 320 / 500: p-value 0.000160821606967919 Sample 330 / 500: p-value 0.0005140574953738786 Sample 340 / 500: p-value 0.0005140574953738786 Sample 350 / 500: p-value 0.0005140574953738786 Sample 360 / 500: p-value 3.303147547493567e-05 Sample 370 / 500: p-value 0.0005140574953738786 Sample 380 / 500: p-value 0.0010278710533060788 Sample 390 / 500: p-value 0.0010278710533060788 Sample 400 / 500: p-value 0.0010278710533060788 Sample 410 / 500: p-value 0.0010278710533060788 Sample 420 / 500: p-value 0.0005140574953738786 Sample 430 / 500: p-value 0.0010278710533060788 Sample 440 / 500: p-value 0.0005140574953738786 Sample 450 / 500: p-value 0.0005140574953738786 Sample 460 / 500: p-value 0.0010278710533060788 Sample 470 / 500: p-value 0.0005140574953738786 Sample 480 / 500: p-value 0.0010278710533060788 Sample 490 / 500: p-value 0.0010278710533060788
5 GO terms found significant (< 0.05=alpha) after multitest correction: local fdr
g.print_results(g_res, min_ratio=None, pval=0.01)
GO NS enrichment name ratio_in_study ratio_in_pop p_uncorrected depth study_count p_bonferroni p_sidak p_holm p_fdr study_items GO:0030154 BP e cell differentiation 6/13 57/25289 1.69e-13 3 6 9.09e-10 8.86e-10 9.09e-10 0 Q3E880, Q6DSU1, Q6ID76, Q6NNL3, Q9LI64, Q9SI57 GO:0048869 BP e cellular developmental process 6/13 113/25289 1.16e-11 2 6 6.24e-08 6.09e-08 6.24e-08 0 Q3E880, Q6DSU1, Q6ID76, Q6NNL3, Q9LI64, Q9SI57 GO:0032502 BP e developmental process 7/13 712/25289 2.02e-08 1 7 0.000108 0.000105 0.000108 0 B3H5J1, Q3E880, Q6DSU1, Q6ID76, Q6NNL3, Q9LI64, Q9SI57 GO:2000012 BP e regulation of auxin polar transport 2/13 20/25289 4.61e-05 5 2 0.247 0.241 0.247 0.014 Q93VK8, Q9FGF6 GO:0051049 BP e regulation of transport 2/13 42/25289 0.000208 4 2 1 1 1 0.036 Q93VK8, Q9FGF6 GO:0010817 BP e regulation of hormone levels 2/13 69/25289 0.000561 3 2 1 1 1 0.422 Q93VK8, Q9FGF6 GO:0032879 BP e regulation of localization 2/13 107/25289 0.00134 3 2 1 1 1 0.638 Q93VK8, Q9FGF6 GO:0065008 BP e regulation of biological quality 2/13 265/25289 0.00791 2 2 1 1 1 0.988 Q93VK8, Q9FGF6 GO:0048527 BP e lateral root development 1/13 17/25289 0.00871 6 1 1 1 1 0.99 B3H5J1
We can see that the most significant term is GO:0030154
(cell differentiation). It makes sense that this would be associated with proteins having the keyword "growth" in their name.
How many terms are enriched, when using the Bonferroni corrected method and a p-value $\le$ 0.01?
Perform the GO Enrichment study using the Bonferroni corrected method.
g_bonferroni = GOEnrichmentStudy(pop, assoc, go,
propagate_counts=True,
alpha=0.01,
methods=['bonferroni'])
g_bonferroni_res = g_bonferroni.run_study(study)
fisher module not installed. Falling back on scipy.stats.fisher_exact
Propagating term counts to parents ..
100% 25,289 of 25,289 population items found in association 100% 13 of 13 study items found in association 100% 13 of 13 study items found in population(25289) Calculating 5,364 uncorrected p-values using fisher_scipy_stats 5,364 GO terms are associated with 25,289 of 25,289 population items 40 GO terms are associated with 13 of 13 study items 3 GO terms found significant (< 0.01=alpha) after multitest correction: local bonferroni
s_bonferroni = []
for x in g_bonferroni_res:
if x.p_bonferroni <= 0.01:
s_bonferroni.append((x.goterm.id, x.p_bonferroni))
print(s_bonferroni)
[('GO:0030154', 9.088903406049323e-10), ('GO:0048869', 6.244383841758757e-08), ('GO:0032502', 0.00010815519774707948)]
How many terms are enriched with false discovery rate (a.k.a. q-value) $\le$ 0.01?
g_fdr = GOEnrichmentStudy(pop, assoc, go,
propagate_counts=True,
alpha=0.05,
methods=['fdr'])
g_fdr_res = g_fdr.run_study(study)
fisher module not installed. Falling back on scipy.stats.fisher_exact
Propagating term counts to parents ..
100% 25,289 of 25,289 population items found in association 100% 13 of 13 study items found in association 100% 13 of 13 study items found in population(25289) Calculating 5,364 uncorrected p-values using fisher_scipy_stats 5,364 GO terms are associated with 25,289 of 25,289 population items 40 GO terms are associated with 13 of 13 study items fisher module not installed. Falling back on scipy.stats.fisher_exact
Generate p-value distribution for FDR based on resampling (this might take a while) Sample 0 / 500: p-value 0.011253078458717746 Sample 10 / 500: p-value 0.0005140574953738786 Sample 20 / 500: p-value 0.0010278710533060788 Sample 30 / 500: p-value 0.0005140574953738786 Sample 40 / 500: p-value 0.0005140574953738786 Sample 50 / 500: p-value 0.0008701371457487843 Sample 60 / 500: p-value 0.0005140574953738786 Sample 70 / 500: p-value 0.0005140574953738786 Sample 80 / 500: p-value 0.0015414407798976498 Sample 90 / 500: p-value 0.0010278710533060788 Sample 100 / 500: p-value 0.0005140574953738786 Sample 110 / 500: p-value 0.003593283493731873 Sample 120 / 500: p-value 0.0010278710533060788 Sample 130 / 500: p-value 0.0005140574953738786 Sample 140 / 500: p-value 0.006663750638263605 Sample 150 / 500: p-value 0.003080688032113681 Sample 160 / 500: p-value 0.0015414407798976498 Sample 170 / 500: p-value 0.0010278710533060788 Sample 180 / 500: p-value 0.004105635653855762 Sample 190 / 500: p-value 0.0010278710533060788 Sample 200 / 500: p-value 0.0005140574953738786 Sample 210 / 500: p-value 0.0010278710533060788 Sample 220 / 500: p-value 0.003593283493731873 Sample 230 / 500: p-value 0.0005140574953738786 Sample 240 / 500: p-value 0.0010278710533060788 Sample 250 / 500: p-value 0.0005140574953738786 Sample 260 / 500: p-value 0.0005140574953738786 Sample 270 / 500: p-value 0.0010278710533060788 Sample 280 / 500: p-value 0.003080688032113681 Sample 290 / 500: p-value 0.0005140574953738786 Sample 300 / 500: p-value 0.0031108913301437103 Sample 310 / 500: p-value 0.0010278710533060788 Sample 320 / 500: p-value 0.0010278710533060788 Sample 330 / 500: p-value 0.0005140574953738786 Sample 340 / 500: p-value 0.0010278710533060788 Sample 350 / 500: p-value 0.011761793778736767 Sample 360 / 500: p-value 0.0010278710533060788 Sample 370 / 500: p-value 0.0005140574953738786 Sample 380 / 500: p-value 0.0005140574953738786 Sample 390 / 500: p-value 0.0005140574953738786 Sample 400 / 500: p-value 0.011253078458717746 Sample 410 / 500: p-value 0.0005140574953738786 Sample 420 / 500: p-value 0.0005140574953738786 Sample 430 / 500: p-value 0.0010278710533060788 Sample 440 / 500: p-value 0.0005140574953738786 Sample 450 / 500: p-value 0.0010278710533060788 Sample 460 / 500: p-value 0.0010278710533060788 Sample 470 / 500: p-value 0.0005140574953738786 Sample 480 / 500: p-value 0.002567849163328763 Sample 490 / 500: p-value 0.0005140574953738786
5 GO terms found significant (< 0.05=alpha) after multitest correction: local fdr
s_fdr = []
for x in g_fdr_res:
if x.p_fdr <= 0.01:
s_fdr.append((x.goterm.id, x.p_fdr))
print(s_fdr)
[('GO:0030154', 0.0), ('GO:0048869', 0.0), ('GO:0032502', 0.0), ('GO:2000012', 0.006)]
In this section we look at how to compute semantic similarity between GO terms.
from collections import Counter
class TermCounts():
'''
TermCounts counts the term counts for each
'''
def __init__(self, go, annots):
'''
Initialise the counts and
'''
# Backup
self._go = go
# Initialise the counters
self._counts = Counter()
self._aspect_counts = Counter()
# Fill the counters...
self._count_terms(go, annots)
def _count_terms(self, go, annots):
'''
Fills in the counts and overall aspect counts.
'''
for x in annots:
# Extract term information
go_id = annots[x]['GO_ID']
namespace = go[go_id].namespace
self._counts[go_id] += 1
rec = go[go_id]
parents = rec.get_all_parents()
for p in parents:
self._counts[p] += 1
self._aspect_counts[namespace] += 1
def get_count(self, go_id):
'''
Returns the count of that GO term observed in the annotations.
'''
return self._counts[go_id]
def get_total_count(self, aspect):
'''
Gets the total count that's been precomputed.
'''
return self._aspect_counts[aspect]
def get_term_freq(self, go_id):
'''
Returns the frequency at which a particular GO term has
been observed in the annotations.
'''
try:
namespace = self._go[go_id].namespace
freq = float(self.get_count(go_id)) / float(self.get_total_count(namespace))
except ZeroDivisionError:
freq = 0
return freq
GO:0048364
(root development) and GO:0044707
(single-multicellular organism process) are two GO terms taken from Figure 1. Calculate the semantic similarity between them based on the inverse of the semantic distance (number of branches separating them).
First we need to write a function that calculates the minimum number of branches connecting two GO terms.
def min_branch_length(go_id1, go_id2, go):
'''
Finds the minimum branch length between two terms in the GO DAG.
'''
# First get the deepest common ancestor
dca = deepest_common_ancestor([go_id1, go_id2], go)
# Then get the distance from the DCA to each term
dca_depth = go[dca].depth
d1 = go[go_id1].depth - dca_depth
d2 = go[go_id2].depth - dca_depth
# Return the total distance - i.e., to the deepest common ancestor and back.
return d1 + d2
go_id3 = 'GO:0048364'
go_id4 = 'GO:0044707'
Now we can calculate the semantic distance and semantic similarity, as so:
def semantic_distance(go_id1, go_id2, go):
'''
Finds the semantic distance (minimum number of connecting branches)
between two GO terms.
'''
return min_branch_length(go_id1, go_id2, go)
def semantic_similarity(go_id1, go_id2, go):
'''
Finds the semantic similarity (inverse of the semantic distance)
between two GO terms.
'''
return 1.0 / float(semantic_distance(go_id1, go_id2, go))
sim = semantic_similarity(go_id3, go_id4, go)
print('The semantic similarity between terms {} and {} is {}.'.format(go_id3, go_id4, sim))
The semantic similarity between terms GO:0048364 and GO:0044707 is 0.2.
Calculate the the information content (IC) of the GO term GO:0048364
(root development), based on the frequency of observation in Arabidopsis thaliana.
First we need to define what the information content is.
import math
def ic(go_id, termcounts):
'''
Calculates the information content of a GO term.
'''
# Get the observed frequency of the GO term
freq = termcounts.get_term_freq(go_id)
# Calculate the information content (i.e., -log("freq of GO term")
return -1.0 * math.log(freq)
Then we can calculate the information content of the single term, GO:0048364
.
# First get the counts of each GO term.
termcounts = TermCounts(go, arab_funcs)
# Calculate the information content
infocontent = ic(go_id, termcounts)
print(infocontent)
6.703332455042499
Calculate the Resnik similarity measure between the same two terms as in 5.1.a.
Recall that Resnik's similarity measure is defined as the information content of the most informative common ancestor. That is, the most specific common parent-term in the GO.
def resnik_sim(go_id1, go_id2, go, termcounts):
'''
Computes Resnik's similarity measure.
'''
msca = deepest_common_ancestor([go_id1, go_id2], go)
return ic(msca, termcounts)
Then we can calculate this as follows
sim_r = resnik_sim(go_id3, go_id4, go, termcounts)
print('Resnik similarity score: {}'.format(sim_r))
Resnik similarity score: -0.0