# Enrichment analysis and visualisations¶

This is loosely adapted from one of the InterMineR vignettes https://bioconductor.org/packages/release/bioc/vignettes/InterMineR/inst/doc/Enrichment_Analysis_and_Visualization.html

## Background¶

The InterMine gene set enrichment tool looks for annotations to a set of genes that occur more than would be expected by chance, given a background population of annotations. The hypergeometric distribution is used to calculate a P-value for each annotation and a choice of correction methods for multiple testing (Bonferonni, Holm-Bonferonni and Benjamini Hochberg) are available (Smith et al. 2012; Kalderimis et al. 2014).

InterMine provides Gene Ontology enrichment statistics as well as enrichment statistics for other annotation types including protein domains, pathways, human diseases, mammalian phenotypes and publications. The default background population is all genes in the genome with the specified annotation type. However, the background population can be changed by specifying another list. More information can be found in the online documentation.

## Goal for this exercise¶

1. Use InterMine's enrichment widgets on the HumanMine public list PL_GenomicsEngland_GenePanel%:Genetic_Epilepsy_Syndromes to see if the list is enriched for any GO Terms.
2. Visualise the results using the GeneAnswers class.

### Setup¶

Let's start by initialising InterMineR and choosing which InterMine we'll be running queries against:

In [1]:
#Begin by activating the InterMineR library:
library(InterMineR)


We want to query human data - so let's look and see what InterMines are available:

In [2]:
listMines()

BMAP
'https://bmap.jgi.doe.gov/bmapmine/'
BeanMine
'https://mines.legumeinfo.org/beanmine'
BovineMine
'http://genomes.missouri.edu/bovinemine'
CHOmine
'https://chomine.boku.ac.at/chomine'
ChickpeaMine
'https://mines.legumeinfo.org/chickpeamine'
CowpeaMine
'https://mines.legumeinfo.org/cowpeamine'
FlyMine
'http://www.flymine.org/flymine'
GrapeMine
'http://urgi.versailles.inra.fr/GrapeMine'
HumanMine
'http://www.humanmine.org/humanmine'
HymenopteraMine
'http://hymenopteragenome.org/hymenopteramine'
IndigoMine
'http://www.cbrc.kaust.edu.sa/indigo'
LegumeMine
'https://mines.legumeinfo.org/legumemine'
MaizeMine
'http://maizemine.rnet.missouri.edu:8080/maizemine'
MedicMine
'http://medicmine.jcvi.org/medicmine'
MitoMiner
'http://mitominer.mrc-mbu.cam.ac.uk/release-4.0'
ModMine
'http://intermine.modencode.org/release-33'
MouseMine
'http://www.mousemine.org/mousemine'
OakMine
'https://urgi.versailles.inra.fr/OakMine_PM1N'
PeanutMine
'https://mines.legumeinfo.org/peanutmine'
PhytoMine
'https://phytozome.jgi.doe.gov/phytomine/'
PlanMine
'http://planmine.mpi-cbg.de/planmine'
RatMine
'http://ratmine.mcw.edu/ratmine'
RepetDB
'http://urgi.versailles.inra.fr/repetdb'
SoyMine
'https://mines.legumeinfo.org/soymine'
TargetMine
'https://targetmine.mizuguchilab.org/targetmine'
ThaleMine
'https://apps.araport.org/thalemine'
WheatMine
'https://urgi.versailles.inra.fr/WheatMine'
WormMine
'http://intermine.wormbase.org/tools/wormmine/'
XenMine
'http://www.xenmine.org/xenmine'
YeastMine
'https://yeastmine.yeastgenome.org/yeastmine'
ZebrafishMine
'http://zebrafishmine.org'

Okay, let's select HumanMine from this list:

In [3]:
humanMine <- listMines()["HumanMine"] #select humanmine
humanMine #print out the value to see what's inside

HumanMine: 'http://www.humanmine.org/humanmine'

And now we need to tell InterMineR to query HumanMine specifically:

In [4]:
im <- initInterMine(mine=humanMine)


## Getting started with Enrichment¶

Performing enrichment analysis with InterMineR is preceded by two steps:

1. Get the enrichment widget name which indicates the annotation types that you want to investigate for enrichment (e.g. Gene Ontology Terms, protein domains, KEGG and Reactome Pathways, human diseases, mammalian phenotypes and publications).
2. Retrieve the list of bioentities of interest (Genes, Proteins, SNPs, etc.) for which the analysis will be performed.
In [5]:
# First, we'll retrieve the widgets available from HumanMine. This will tell us what types of enrichment are available.

humanWidgets <- getWidgets(im)

# Print out the widgets so we can see what there is

humanWidgets

A matrix: 10 × 12 of type chr
startClassDisplayenrichIdentifiernamedescriptionenrichstartClasstitletargetswidgetTypechartTypefilterslabels
primaryIdentifierGWASResults.study.publication.pubMedId snp_gwas_study_enrichment GWAS studies enriched for SNPs in this list. GWASResults.study.name SNP GWAS study Enrichment for SNPs SNP enrichmentNA NA NA
NA NA chromosome_distribution_for_snp Actual: number of items in this list found on each chromosome. Expected: given the total number of items on the chromosome and the number of items in this list, the number of items expected to be found on each chromosome.NA SNP Chromosome Distribution SNP chart ColumnChartorganism.name=[list] Chromosome & Count
primaryIdentifiergoAnnotation.ontologyTerm.parents.identifier go_enrichment_for_gene GO terms enriched for items in this list. goAnnotation.ontologyTerm.parents.name Gene Gene Ontology Enrichment Gene enrichmentNA biological_process,cellular_component,molecular_functionNA
primaryIdentifierpublications.pubMedId publication_enrichment Publications enriched for genes in this list. publications.title Gene Publication Enrichment Gene enrichmentNA NA NA
NA NA chromosome_distribution_for_geneActual: number of items in this list found on each chromosome. Expected: given the total number of items on the chromosome and the number of items in this list, the number of items expected to be found on each chromosome.NA Gene Chromosome Distribution Gene chart ColumnChartorganism.name=[list] Chromosome & Count
primaryIdentifierpathways.identifier pathway_enrichment Pathways enriched for genes in this list - data from KEGG and Reactome pathways.name Gene Pathway Enrichment Gene enrichmentNA All,KEGG pathways data set,Reactome data set NA
primaryIdentifierGWASResults.study.publication.pubMedId snp_publication_enrichment Publications enriched for SNPs in this list. GWASResults.study.publication.title SNP Publication Enrichment for SNPsSNP enrichmentNA NA NA
primaryIdentifierproteins.proteinDomainRegions.proteinDomain.primaryIdentifierprot_dom_enrichment_for_gene Protein Domains enriched for items in this list. proteins.proteinDomainRegions.proteinDomain.nameGene Protein Domain Enrichment Gene enrichmentNA NA NA
primaryIdentifierproteinDomainRegions.proteinDomain.primaryIdentifier prot_dom_enrichment_for_protein Protein Domains enriched for items in this list. proteinDomainRegions.proteinDomain.name ProteinProtein Domain Enrichment ProteinenrichmentNA NA NA
NA NA interactions Genes (from the list or not) that interact with genes in this list. Counts may include the same interaction more than once if observed in multiple experiments. NA NA Interactions Gene table NA NA NA

Take a look through the widgets returned, especially the columns targets and widgetType. Since we plan to enrich the gene list PL_GenomicsEngland_GenePanel:Genetic_Epilepsy_Syndromes, we're only interested in widgets that target Genes. Similarly, we don't need to look at any widgets that aren't enrichment type widgets.

In [6]:
# We'll put the widget list in a data frame, so it's easy to filter.
humanWidgetsDataFrame <- as.data.frame(humanWidgets)

# Now let's ask the data frame to give us a subset of the widgets.
subset(humanWidgetsDataFrame, widgetType == 'enrichment' & targets == 'Gene')

A data.frame: 4 × 12
startClassDisplayenrichIdentifiernamedescriptionenrichstartClasstitletargetswidgetTypechartTypefilterslabels
<fct><fct><fct><fct><fct><fct><fct><fct><fct><fct><fct><fct>
3primaryIdentifiergoAnnotation.ontologyTerm.parents.identifier go_enrichment_for_gene GO terms enriched for items in this list. goAnnotation.ontologyTerm.parents.name GeneGene Ontology Enrichment GeneenrichmentNAbiological_process,cellular_component,molecular_functionNA
4primaryIdentifierpublications.pubMedId publication_enrichment Publications enriched for genes in this list. publications.title GenePublication Enrichment GeneenrichmentNANA NA
6primaryIdentifierpathways.identifier pathway_enrichment Pathways enriched for genes in this list - data from KEGG and Reactomepathways.name GenePathway Enrichment GeneenrichmentNAAll,KEGG pathways data set,Reactome data set NA
8primaryIdentifierproteins.proteinDomainRegions.proteinDomain.primaryIdentifierprot_dom_enrichment_for_geneProtein Domains enriched for items in this list. proteins.proteinDomainRegions.proteinDomain.nameGeneProtein Domain EnrichmentGeneenrichmentNANA NA

#### Syntax Note: = vs ==¶

##### Assignment operator: = or <-¶

In the code example above, we use the double equals sign ==. This is because a single equals sign is used for assignment, e.g. kittens = 5, which would assign the value 5 to the variable kittens.

In R we also use <- for assignment, so kittens = 5 and kittens <- 5 do (basically) the same thing.

##### Comparison operator: ==¶

The double equals, however, is used for comparison - when we say widgetType == 'enrichment' we're saying that the code should compare all widgetTypes, and only give us ones that are equal to enrichment.

### Perform enrichment analysis¶

Looking at the filtered list of widgets above, it looks like the widget we want to use to check for enriched GO Terms is named go_enrichment_for_gene. We'll use the doEnrichment method to get the enrichment results from HumanMine.

In [7]:
# We're using the widget go_enrichment_for_gene against the list name
# PL_GenomicsEngland_GenePanel:Genetic_Epilepsy_Syndromes
#
#Syntax:
#
# GO_enrichResult = doEnrichment(
#   im = YOUR_CHOSEN_INTERMINE_HERE
#   genelist = LIST_NAME_SAVE_ON_YOUR_CHOSEN_INTERMINE,
#   widget = "go_enrichment_for_gene"
#   organism = "Homo sapiens" # optional if results from more than one organism are retrieved
# )

GOEnrichmentResult = doEnrichment(
im = im,
genelist = "PL_GenomicsEngland_GenePanel:Genetic_Epilepsy_Syndromes",
widget = "go_enrichment_for_gene"
)

# Print the results
GOEnrichmentResult

$data A data.frame: 9 × 5 identifierdescriptionpValuecountpopulationAnnotationCount <chr><chr><chr><chr><chr> GO:0086010membrane depolarization during action potential0.0029933397620292668336 GO:0019226transmission of nerve impulse 0.012440595908190567 372 GO:0051899membrane depolarization 0.013711012061119023 385 GO:0019227neuronal action potential propagation 0.015706289435420368 29 GO:0098870action potential propagation 0.015706289435420368 29 GO:0042391regulation of membrane potential 0.015721651039763786 4421 GO:0001508action potential 0.019257259325613202 3126 GO:0035725sodium ion transmembrane transport 0.022633164613943716 3139 GO:0050877nervous system process 0.03723962575587324 51365$populationCount
16597
$notAnalysed 0$im
$mine HumanMine: 'http://www.humanmine.org/humanmine'$token
''
$parameters genelist 'PL_GenomicsEngland_GenePanel:Genetic_Epilepsy_Syndromes' widget 'go_enrichment_for_gene' maxp '0.05' correction 'Benjamini Hochberg' ## Visualise it with Gene Answers¶ For the visualisation, we'll need both the enrichment results and the list of genes we originally enriched - but the list is still on the server. Let's run a query to get the primaryIdentifiers for the genes in PL_GenomicsEngland_GenePanel:Genetic_Epilepsy_Syndromes. In [8]: # We'll be using the setQuery method here # # Syntax: # setQuery( # select = c("VIEW.NAME1", "VIEW.NAME2", .... ), # where = setConstraints( # paths = c("CONSTRAINT.PATH.A", "CONSTRAINT.PATH.B"), # operators = c("CONSTRAINT.OPERATOR.A", "CONSTRAINT.OPERATOR.B"), # values = list("CONSTRAINT.VALUE.A","CONSTRAINT.VALUE.B") # ) #) queryEpilepsy <- setQuery( # Choose which columns of data we'd like to see select = c("Gene.primaryIdentifier"), where = setConstraints( paths = c("Gene"), operators = c("IN"), values = list("PL_GenomicsEngland_GenePanel:Genetic_Epilepsy_Syndromes")) ) queryEpilepsyResults <- runQuery(im, queryEpilepsy) queryEpilepsyResults  A data.frame: 6 × 1 Gene.primaryIdentifier <chr> 2566 57094 6323 6324 6335 84059 ### Format data for GeneAnswers¶ We'll be using GeneAnswers to generate some visualisations of the data we've just fetched, so we'll need to convert our InterMine data into a shape that GeneAnswers can read. Luckily, there's a nice method for that in InterMineR: convertToGeneAnswers. In [15]: # load GeneAnswers package library(GeneAnswers) # convert InterMineR Gene Ontology Enrichment analysis results to GeneAnswers object queryEpilepsyForGeneAnswers = convertToGeneAnswers( # Pass the enrichment results we ran earlier to GeneAnswers: enrichmentResult = GOEnrichmentResult, # Pass in our original list of gene IDs geneInput = data.frame(GeneID = queryEpilepsyResults, stringsAsFactors = FALSE), # in our example we use Gene.primaryIdentifier values geneInputType = "Gene.primaryIdentifier", # The type of enrichment widget we ran # in our example we use Gene Ontology (GO) terms: categoryType = "GO" )  We can use the summary(geneAnswersObjectHere) function to see interesting info about our list + enrichment.... In [10]: summary(queryEpilepsyForGeneAnswers)  This GeneAnswers instance was build from GO based on hyperG test. Statistical information of 9 categories with p value less than 0.05 are reported. Other categories are considered as nonsignificant. There are 9 categories related to the given 6 genes Summary of GeneAnswers instance information: Slot: geneInput Gene.primaryIdentifier 1 2566 2 57094 3 6323 4 6324 5 6335 6 84059 Slot: testType [1] "hyperG" Slot: pvalueT [1] 0.05 Slot: genesInCategory$GO:0086010
[1] "6323" "6324" "6335"

$GO:0019226 [1] "6323" "6324" "6335"$GO:0051899
[1] "6323" "6324" "6335"

$GO:0019227 [1] "6323" "6324"$GO:0098870
[1] "6323" "6324"

\$GO:0042391
[1] "2566" "6323" "6324" "6335"

......

Slot: geneExprProfile
NULL

Slot: annLib
[1] "org.Hs.eg.db"

Slot: categoryType
[1] "GO"

Slot: enrichmentInfo
genes in Category percent in the observed List percent in the genome
GO:0086010                 3                    0.5000000          0.0021690667
GO:0019226                 3                    0.5000000          0.0043381334
GO:0051899                 3                    0.5000000          0.0051214075
GO:0019227                 2                    0.3333333          0.0005422667
GO:0098870                 2                    0.3333333          0.0005422667
GO:0042391                 4                    0.6666667          0.0253660300
fold of overrepresents odds ratio    p value fdr p value
GO:0086010              230.51389  460.02778 0.00299334  0.00299334
GO:0019226              115.25694  229.51389 0.01244060  0.01244060
GO:0051899               97.62941  194.25882 0.01371101  0.01371101
GO:0019227              614.70370  921.55556 0.01570629  0.01570629
GO:0098870              614.70370  921.55556 0.01570629  0.01570629
GO:0042391               26.28187   76.84561 0.01572165  0.01572165
......


### It's time for some graphs¶

We'll start with a concept-gene network that shows which genes are associated with our given GO terms

In [16]:
geneAnswersConceptNet(queryEpilepsyForGeneAnswers,
colorValueColumn=NULL,
centroidSize='correctedPvalue',
output='fixed',
geneSymbol = TRUE)


#### GO structure network¶

We can plot the GO terms that our list is enriched for to show their relationships to one another:

In [12]:
geneAnswersConceptRelation(queryEpilepsyForGeneAnswers,
directed=TRUE,
output="fixed",
netMode='connection')

[1] "Search Tree Merge without filterGraphIDs!"
[1] "Searching hubs ..."

Loading required package: GO.db


[1] "The Graph without any nodes removal is not a connected one. Search 4 layer(s). Drawing network ..."
[1] "Building graph structure ..."
[1] "Removing NULL element ..."
[1] "Removing NA ..., including NA in names of the input list"
[1] "Converting to a matrix ..."
[1] "For the given directed graph, the node 1 or 2 or 3 might be the root."


#### Gene Interaction Network¶

In this example, GeneAnswers uses the annotation package ‘org.Hs.eg.db’, which was used previously in convertToGeneAnswers function, to retrieve gene interactions for the specified genes.

In [13]:
buildNet(getGeneInput(queryEpilepsyForGeneAnswers)[,1],
idType='GeneInteraction',
layers=3,
output="fixed",
filterLayer=2,
netMode='connection')

[1] "Building graph structure ..."
[1] "Removing NULL element ..."
[1] "Removing NA ..., including NA in names of the input list"
[1] "Converting to a matrix ..."
[1] "For the given undirected graph, the node 3 might be the root."


Gene Interaction Network for PL_GenomicsEngland_GenePanel:Genetic_Epilepsy_Syndromes.

The large black dots with yellow frame stand for the six given genes. They also connect to other genes by dark-blue-purple edges. Small black dots represent the other genes from getGeneInput. Small white dots are genes that are not in the genes from getGeneInput, but interact with these genes.

#### Genes vs GO Term heat map¶

Let's plot the genes which were enriched for GO terms in a heat map...

In [14]:
geneAnswersHeatmap(queryEpilepsyForGeneAnswers,
catTerm=TRUE,
geneSymbol=TRUE,
cex.axis = 0.75)

initial  value 0.000000
final  value 0.000000
converged
initial  value 0.000000
final  value 0.000000
converged