Giovanni M. Dall'Olio, GlaxoSmithKline. http://bioinfoblog.it
Welcome to the Bioconductor / Data Integration workshop.
This workshop is heavily inspired by the Coursera Bioconductor course. See here for materials: http://kasperdanielhansen.github.io/genbioconductor/
Course materials: http://dalloliogm.github.io
This workshop requires several bioconductor libraries, which take a while to install.
Please start their installation by copying&pasting the commands below. We'll continue the lecture while they get installed:
# bioconductor
source("http://bioconductor.org/biocLite.R")
biocLite("Homo.sapiens")
biocLite("rtracklayer")
biocLite("DOSE")
biocLite("clusterProfiler")
biocLite("AnnotationHub")
# dplyr
install.packages(c("dplyr"))
Bioconductor is the biggest repository of bioinformatics libraries for R. It contains code for dealing with genomic coordinates, calculating differential expression, converting gene symbols and ids, and much more.
To be included in bioconductor, a library must:
First, run the following:
source("http://bioconductor.org/biocLite.R")
At this point you can run the biocLite function to install any package available at http://bioconductor.org/packages/release/BiocViews.html#___Software
For example:
biocLite("limma")
will install the library limma (a popular library for diff. expression analysis)
Tip: The next time you run R, you can use library(BiocInstaller) instead of source("http://bioconductor.org/biocLite.R")
Overview of a package:
library(help=Homo.sapiens)
ls("package:Homo.sapiens")
Vignettes
library(Homo.sapiens)
vignette()
Given a function from a Bioconductor package, get a list of all the objects objects to which it can be applied:
showMethods("findOverlaps")
Apart from analysis functions, Bioconductor contains several data packages (https://www.bioconductor.org/packages/release/data/annotation/), assembled from datasets of common interest for multiple organisms.
Some examples:
In addition two packages allow to access large dataset repositories:
In this tutorial we will see some of these (TxDB, org.eg.db, AnnotationHub).
Most data packages are annotated regularly, with different frequency depending on the data type, from 6 months to 2 years.
This is a good compromise between reproducibility of analysis, efforts needed to maintain the packages, and getting recent data.
If your species is not present or you need more recent data, you can create a package on your own.
The procedure is well documented and usually require just a table as input. It's an easy way to become a Bioconductor developer and good for CV :-)
Let's load the Homo.sapiens package. You will see that it will load several other packages:
library(Homo.sapiens)
library(help="Homo.sapiens")
Bioconductor contains similar wrapper packages for the most common model species (e.g. mouse, rat).
The org.*.eg.db packages allow to retrieve gene symbols and ids relative to a species (see list of all packages). The data is updated every two years, which is a good compromise between reproducibility and getting recent data.
If no org.eg.db package is available for your species, you may consider creating one yourself. You only require a table of genes/symbols, and following a tutorial
To see which data is included in the human package, we can open its help page:
library(help=org.Hs.eg.db)
In alternative, we can use the columns() function:
columns(org.Hs.eg.db)
This means that the org.Hs.eg.db package contains mapping between Entrez IDs, PFAM, Prosite, Genenames, GO, etc... for all human genes.
The BiMaps in a library can also be accessed as variable. For example, org.Hs.eg.db contains many variables starting with "org.Hs.eg":
ls("package:org.Hs.eg.db")
These can be converted to table using as.data.frame or toTable. For example the following gets a dataframe of all KEGG pathways to Entrez:
head(toTable(org.Hs.egPATH2EG))
To get the name and description of these KEGG pathways, we would have to load the KEGG.db data package.
These packages also allows to get more documentation:
?org.Hs.egPATH2EG
The basic way to get data from these BiMap objects is the select() function from AnnotationDbi.
For example, here is how to get entrez and description of genes MGAT2 and MGAT3:
AnnotationDbi::select(org.Hs.eg.db, keys=as.character(c("MGAT2", "MGAT3")), keytype='SYMBOL', columns=c('ENTREZID','GENENAME'))
The "keys" argument defines the symbols that we want to search. The "keytype" and "columns" define the type of input and output. Use the functions keytypes(org.Hs.eg.db) and columns(org.Hs.eg.db) to see which values are supported.
What is the name and the Ensembl ID of the gene with Entrez ID 1234?
AnnotationDbi::select(org.Hs.eg.db, keys=as.character(c("1234")), keytype='ENTREZID', columns=c('SYMBOL','ENSEMBL'))
The Gene Ontology database annotates terms related to the biological process, molecular function, and cellular compartment for every gene.
Can you get all the Gene Ontology (GO) terms associated to PTEN?
head(select(org.Hs.eg.db, keys='PTEN', keytype='SYMBOL', columns='GO'))
To get the definition of these GO ids, we can use the GO.db database:
PTEN.go = select(org.Hs.eg.db, keys='PTEN', keytype='SYMBOL', columns='GO')
PTEN.go$TERM = AnnotationDbi::select(GO.db, keys=as.character(PTEN.go$GO), columns="TERM")$TERM
head(PTEN.go)
When we have a long list of genes, listing all the Gene Ontology terms associated to each gene produces a very verbose output, which is difficult to use.
It is more useful to do an Ontology Enrichment analysis, to see which are the most represented terms in the list.
There are several functions to do enrichment analysis in BiocConductor, but today we will use the enrichGO function from clusterProfiler:
library(clusterProfiler)
library(DOSE)
mygenes = c(23450, 5160, 7126, 27118, 8452, 3675, 6624, 3638, 9918, 123, 213, 3383,1869,890,1871,9134,9928,1029,3479,6573,3757,332)
print(mygenes[1:10])
# Let's calculate the enrichment
myenrich = enrichGO(mygenes, "human", ont="MF", readable=F)
head(summary(myenrich))
Behind the lines, enrichGO retrieves the GO annotations from the org.Hs.eg.db package, and calculates the enrichment.
The output can be plot in several ways. For example as a graph, using cnetplot:
cnetplot(myenrich)
We can also do an enrichment using other ontologies.
An ontology is a controlled dictionary for a category of terms. They allow to avoid ambigueties and misunderstandings, by defining the standard name of a term (e.g. epilepsy) and annotating all the possible synonyms (e.g. epilepsy syndrome)
In bioinformatics the most famous ontology is GO, but there are many others. For example MeSH annotates medical terms, and Disease Ontology (subset of MeSH) annotates disease names.
When an Ontology is provided with mappings between terms and genes, we can use it to calculate enrichment. For example, we can use Disease Ontology to see which diseases are enriched in our list:
library(DOSE)
myenrich.do = enrichDO(mygenes)
head(summary(myenrich.do))
You can also plot the output of the enrichment.
NOTE: due to a bug in the current version of clusterProfiler, the following plot does not currently work. (will try to fix it next week)
plot(summary(myenrich.do))
Another package loaded with Homo.sapiens is a TxDb object, named TxDb.Hsapiens.UCSC.hg19.knownGene.
This contains coordinates of genes, transcripts and exons in the human genome.
We can get the gene coordinates using the genes() function:
genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
Apart from genes, we can extract coordinates of transcripts(), exons(), cds(), and promoters(), using the homonymous function.
The output of these functions can be customized to include additional columns, e.g. you may want to print both the exon and gene id:
exons(TxDb.Hsapiens.UCSC.hg19.knownGene)
exons(TxDb.Hsapiens.UCSC.hg19.knownGene, columns=c("exon_id", "gene_id"))
To get the list of all possible additional columns, we can use the columns() function as for the org.Hs.eg.db package:
columns(TxDb.Hsapiens.UCSC.hg19.knownGene)
GenomicRanges object can be subsetted like normal data frames, using the subset function.
Chromosome names must be referred as "seqnames":
subset(exons(TxDb.Hsapiens.UCSC.hg19.knownGene),
seqnames=="chr6" & start>10000000 & end<20000000)
BED and other formats (GFF, bigwig, etc..) can be imported using the import() function from rtracklayer.
Let's import a simple BED file:
library(rtracklayer)
marks = import("https://raw.githubusercontent.com/dalloliogm/belgrade_unix_intro/master/data/genes/example.bed",
format="bed")
marks
If you have trouble downloading the file, just copy these contents and save it to a file called example.bed:
chr8 145701382 145702686 mark1 12 +
chr20 33850819 33852880 mark2 52 -
The BED file just imported contains coordinates of two regions.
We can use the nearest() function to find the genes that are closer to these regions:
human.genes = genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
closest.genes = human.genes[nearest(marks, human.genes)]
closest.genes$mark = marks$name
closest.genes
AnnotationHub is a recently introduced library, which allows to download data from several sources, directly into R:
library(AnnotationHub)
ahub = AnnotationHub()
ahub
For an overview of the data available via AnnotationHub, we can run the display function:
ahub = AnnotationHub()
display(ahub)
In alternative, we can use the subset function:
subset(ahub, species == "Homo sapiens" & dataprovider == "UCSC")
For more complex queries, we can also use query():
query(subset(ahub, species == "Homo sapiens" & dataprovider == "UCSC"), "Most cons")
To download a dataset, we can use double square brackets, and the id of the dataset:
multi.24 = ahub[["AH5224"]]
multi.24
Now that we have conservation scores from the UCSC-Multiz table, we can get the scores for the marks bed file we imported earlier:
subsetByOverlaps(multi.24, marks)
Another useful function is mergeByOverlaps, which combines information from both ranges:
mergeByOverlaps(marks, multi.24)
This can be used to calculate the average conservation by region. In this case we will use the dplyr library for the aggregation:
library(dplyr)
select = AnnotationDbi::select
mergeByOverlaps(marks, multi.24) %>% as.data.frame %>% group_by(marks.seqnames) %>% summarise(mean=mean(score.1))