PEB Belgrade - Bioconductor workshop

Giovanni M. Dall'Olio, GlaxoSmithKline. http://bioinfoblog.it

Welcome to the Bioconductor / Data Integration workshop.

This workshop is heavily inspired by the Coursera Bioconductor course. See here for materials: http://kasperdanielhansen.github.io/genbioconductor/

Requirements

Course materials: http://dalloliogm.github.io

This workshop requires several bioconductor libraries, which take a while to install.

Please start their installation by copying&pasting the commands below. We'll continue the lecture while they get installed:

# bioconductor
source("http://bioconductor.org/biocLite.R")
biocLite("Homo.sapiens")
biocLite("rtracklayer")
biocLite("DOSE")
biocLite("clusterProfiler")
biocLite("AnnotationHub")

# dplyr
install.packages(c("dplyr"))

What is Bioconductor?

Bioconductor is the biggest repository of bioinformatics libraries for R. It contains code for dealing with genomic coordinates, calculating differential expression, converting gene symbols and ids, and much more.

To be included in bioconductor, a library must:

  • have documentation in the form of a vignette()
  • include a battery of tests to make sure that each new version is working
  • be citable, using citation() and possibly be described in a bioinformatics journal

How to get Bioconductor Packages

First, run the following:

source("http://bioconductor.org/biocLite.R")

At this point you can run the biocLite function to install any package available at http://bioconductor.org/packages/release/BiocViews.html#___Software

For example:

biocLite("limma")

will install the library limma (a popular library for diff. expression analysis)

Tip: The next time you run R, you can use library(BiocInstaller) instead of source("http://bioconductor.org/biocLite.R")

How to get help on a Bioconductor package

Overview of a package:

library(help=Homo.sapiens)
ls("package:Homo.sapiens")

Vignettes

library(Homo.sapiens)
vignette()

Given a function from a Bioconductor package, get a list of all the objects objects to which it can be applied:

showMethods("findOverlaps")

The Annotation packages in Bioconductor

Apart from analysis functions, Bioconductor contains several data packages (https://www.bioconductor.org/packages/release/data/annotation/), assembled from datasets of common interest for multiple organisms.

Some examples:

  • TxDB objects: coordinates for genes, transcripts, exons...
  • BSGenome: genome sequences
  • microarray ids (e.g. hgu133): conversions probe to genes for Affymetrix and Illumina arrays
  • org.*.eg.db: gene symbol to id conversion (entrez, ensembl, GO, ..)

In addition two packages allow to access large dataset repositories:

  • biomaRt: any biomart installation, e.g. ensembl, hgnc, (see http://www.biomart.org/)
  • AnnotationHub: access to several resources, e.g. any track in the UCSC browser, and more

In this tutorial we will see some of these (TxDB, org.eg.db, AnnotationHub).

Are Bioconductor Annotation packages up to date?

Most data packages are annotated regularly, with different frequency depending on the data type, from 6 months to 2 years.

This is a good compromise between reproducibility of analysis, efforts needed to maintain the packages, and getting recent data.

My species is not present or the data is outdated.

If your species is not present or you need more recent data, you can create a package on your own.

The procedure is well documented and usually require just a table as input. It's an easy way to become a Bioconductor developer and good for CV :-)

The Homo.sapiens package

Let's load the Homo.sapiens package. You will see that it will load several other packages:

In [19]:
library(Homo.sapiens)
library(help="Homo.sapiens")

Bioconductor contains similar wrapper packages for the most common model species (e.g. mouse, rat).

Gene symbols and IDs: the org.Hs.eg.db package

The org.*.eg.db packages allow to retrieve gene symbols and ids relative to a species (see list of all packages). The data is updated every two years, which is a good compromise between reproducibility and getting recent data.

If no org.eg.db package is available for your species, you may consider creating one yourself. You only require a table of genes/symbols, and following a tutorial

To see which data is included in the human package, we can open its help page:

library(help=org.Hs.eg.db)

In alternative, we can use the columns() function:

In [20]:
columns(org.Hs.eg.db)
  1. 'ENTREZID'
  2. 'PFAM'
  3. 'IPI'
  4. 'PROSITE'
  5. 'ACCNUM'
  6. 'ALIAS'
  7. 'CHR'
  8. 'CHRLOC'
  9. 'CHRLOCEND'
  10. 'ENZYME'
  11. 'MAP'
  12. 'PATH'
  13. 'PMID'
  14. 'REFSEQ'
  15. 'SYMBOL'
  16. 'UNIGENE'
  17. 'ENSEMBL'
  18. 'ENSEMBLPROT'
  19. 'ENSEMBLTRANS'
  20. 'GENENAME'
  21. 'UNIPROT'
  22. 'GO'
  23. 'EVIDENCE'
  24. 'ONTOLOGY'
  25. 'GOALL'
  26. 'EVIDENCEALL'
  27. 'ONTOLOGYALL'
  28. 'OMIM'
  29. 'UCSCKG'

This means that the org.Hs.eg.db package contains mapping between Entrez IDs, PFAM, Prosite, Genenames, GO, etc... for all human genes.

Visualizing a bimap as a table

The BiMaps in a library can also be accessed as variable. For example, org.Hs.eg.db contains many variables starting with "org.Hs.eg":

In [21]:
ls("package:org.Hs.eg.db")
  1. 'org.Hs.eg'
  2. 'org.Hs.egACCNUM'
  3. 'org.Hs.egACCNUM2EG'
  4. 'org.Hs.egALIAS2EG'
  5. 'org.Hs.egCHR'
  6. 'org.Hs.egCHRLENGTHS'
  7. 'org.Hs.egCHRLOC'
  8. 'org.Hs.egCHRLOCEND'
  9. 'org.Hs.eg.db'
  10. 'org.Hs.eg_dbconn'
  11. 'org.Hs.eg_dbfile'
  12. 'org.Hs.eg_dbInfo'
  13. 'org.Hs.eg_dbschema'
  14. 'org.Hs.egENSEMBL'
  15. 'org.Hs.egENSEMBL2EG'
  16. 'org.Hs.egENSEMBLPROT'
  17. 'org.Hs.egENSEMBLPROT2EG'
  18. 'org.Hs.egENSEMBLTRANS'
  19. 'org.Hs.egENSEMBLTRANS2EG'
  20. 'org.Hs.egENZYME'
  21. 'org.Hs.egENZYME2EG'
  22. 'org.Hs.egGENENAME'
  23. 'org.Hs.egGO'
  24. 'org.Hs.egGO2ALLEGS'
  25. 'org.Hs.egGO2EG'
  26. 'org.Hs.egMAP'
  27. 'org.Hs.egMAP2EG'
  28. 'org.Hs.egMAPCOUNTS'
  29. 'org.Hs.egOMIM'
  30. 'org.Hs.egOMIM2EG'
  31. 'org.Hs.egORGANISM'
  32. 'org.Hs.egPATH'
  33. 'org.Hs.egPATH2EG'
  34. 'org.Hs.egPFAM'
  35. 'org.Hs.egPMID'
  36. 'org.Hs.egPMID2EG'
  37. 'org.Hs.egPROSITE'
  38. 'org.Hs.egREFSEQ'
  39. 'org.Hs.egREFSEQ2EG'
  40. 'org.Hs.egSYMBOL'
  41. 'org.Hs.egSYMBOL2EG'
  42. 'org.Hs.egUCSCKG'
  43. 'org.Hs.egUNIGENE'
  44. 'org.Hs.egUNIGENE2EG'
  45. 'org.Hs.egUNIPROT'

These can be converted to table using as.data.frame or toTable. For example the following gets a dataframe of all KEGG pathways to Entrez:

In [22]:
head(toTable(org.Hs.egPATH2EG))
gene_idpath_id
2 04610
9 00232
9 00983
9 01100
10 00232
10 00983

To get the name and description of these KEGG pathways, we would have to load the KEGG.db data package.

These packages also allows to get more documentation:

?org.Hs.egPATH2EG

Entrez and symbols

The basic way to get data from these BiMap objects is the select() function from AnnotationDbi.

For example, here is how to get entrez and description of genes MGAT2 and MGAT3:

In [23]:
AnnotationDbi::select(org.Hs.eg.db, keys=as.character(c("MGAT2", "MGAT3")), keytype='SYMBOL', columns=c('ENTREZID','GENENAME'))
SYMBOLENTREZIDGENENAME
MGAT2 4247 mannosyl (alpha-1,6-)-glycoprotein beta-1,2-N-acetylglucosaminyltransferase
MGAT3 4248 mannosyl (beta-1,4-)-glycoprotein beta-1,4-N-acetylglucosaminyltransferase

The "keys" argument defines the symbols that we want to search. The "keytype" and "columns" define the type of input and output. Use the functions keytypes(org.Hs.eg.db) and columns(org.Hs.eg.db) to see which values are supported.

Exercise 1: symbol and Ensembl

What is the name and the Ensembl ID of the gene with Entrez ID 1234?

In [24]:
AnnotationDbi::select(org.Hs.eg.db, keys=as.character(c("1234")), keytype='ENTREZID', columns=c('SYMBOL','ENSEMBL'))
ENTREZIDSYMBOLENSEMBL
1234 CCR5 ENSG00000160791

Exercise 2- Gene Ontology

The Gene Ontology database annotates terms related to the biological process, molecular function, and cellular compartment for every gene.

Can you get all the Gene Ontology (GO) terms associated to PTEN?

In [25]:
head(select(org.Hs.eg.db, keys='PTEN', keytype='SYMBOL', columns='GO'))
SYMBOLGOEVIDENCEONTOLOGY
PTEN GO:0000079TAS BP
PTEN GO:0000287IEA MF
PTEN GO:0001525IEA BP
PTEN GO:0001933IDA BP
PTEN GO:0002902IEA BP
PTEN GO:0004438IDA MF

To get the definition of these GO ids, we can use the GO.db database:

In [26]:
PTEN.go = select(org.Hs.eg.db, keys='PTEN', keytype='SYMBOL', columns='GO')
PTEN.go$TERM = AnnotationDbi::select(GO.db, keys=as.character(PTEN.go$GO), columns="TERM")$TERM
head(PTEN.go)
SYMBOLGOEVIDENCEONTOLOGYTERM
PTEN GO:0000079 TAS BP regulation of cyclin-dependent protein serine/threonine kinase activity
PTEN GO:0000287 IEA MF magnesium ion binding
PTEN GO:0001525 IEA BP angiogenesis
PTEN GO:0001933 IDA BP negative regulation of protein phosphorylation
PTEN GO:0002902 IEA BP regulation of B cell apoptotic process
PTEN GO:0004438 IDA MF phosphatidylinositol-3-phosphatase activity

Gene Ontology Enrichment with DOSE and clusterProfiler

When we have a long list of genes, listing all the Gene Ontology terms associated to each gene produces a very verbose output, which is difficult to use.

It is more useful to do an Ontology Enrichment analysis, to see which are the most represented terms in the list.

There are several functions to do enrichment analysis in BiocConductor, but today we will use the enrichGO function from clusterProfiler:

In [27]:
library(clusterProfiler)
library(DOSE)

mygenes = c(23450, 5160, 7126, 27118, 8452, 3675, 6624, 3638, 9918, 123, 213, 3383,1869,890,1871,9134,9928,1029,3479,6573,3757,332)
print(mygenes[1:10])
 [1] 23450  5160  7126 27118  8452  3675  6624  3638  9918   123
In [28]:
# Let's calculate the enrichment
myenrich = enrichGO(mygenes, "human", ont="MF", readable=F)
head(summary(myenrich))
IDDescriptionGeneRatioBgRatiopvaluep.adjustqvaluegeneIDCount
GO:0019899GO:0019899 enzyme binding 9/20 1481/18585 9.460737e-06 0.001712393 0.001145247 1029/213/332/3675/3757/7126/8452/890/9928 9
GO:0005178GO:0005178 integrin binding 3/20 104/18585 1.810737e-04 0.009349881 0.006253192 3383/3479/3675 3
GO:0032403GO:0032403 protein complex binding 6/20 852/18585 2.032337e-04 0.009349881 0.006253192 332/3383/3479/3675/6624/9928 6
GO:0016538GO:0016538 cyclin-dependent protein serine/threonine kinase regulator activity2/20 20/18585 2.066272e-04 0.009349881 0.006253192 1029/9134 2
GO:0019904GO:0019904 protein domain specific binding 5/20 588/18585 3.255709e-04 0.011785666 0.007882242 3675/3757/7126/8452/9928 5
GO:0050839GO:0050839 cell adhesion molecule binding 3/20 171/18585 7.775958e-04 0.023457474 0.015688337 3383/3479/3675 3

Behind the lines, enrichGO retrieves the GO annotations from the org.Hs.eg.db package, and calculates the enrichment.

The output can be plot in several ways. For example as a graph, using cnetplot:

In [29]:
cnetplot(myenrich)

Enrichment using other ontologies

We can also do an enrichment using other ontologies.

What is an Ontology?

An ontology is a controlled dictionary for a category of terms. They allow to avoid ambigueties and misunderstandings, by defining the standard name of a term (e.g. epilepsy) and annotating all the possible synonyms (e.g. epilepsy syndrome)

In bioinformatics the most famous ontology is GO, but there are many others. For example MeSH annotates medical terms, and Disease Ontology (subset of MeSH) annotates disease names.

When an Ontology is provided with mappings between terms and genes, we can use it to calculate enrichment. For example, we can use Disease Ontology to see which diseases are enriched in our list:

In [30]:
library(DOSE)
myenrich.do = enrichDO(mygenes)
head(summary(myenrich.do))
IDDescriptionGeneRatioBgRatiopvaluep.adjustqvaluegeneIDCount
DOID:0060116DOID:0060116 sensory system cancer 11/18 139/8018 8.183720e-16 2.168686e-13 8.011431e-14 3383/1869/890/1871/9134/9928/1029/3479/6573/3757/332 11
DOID:2174DOID:2174 ocular cancer 11/18 139/8018 8.183720e-16 2.168686e-13 8.011431e-14 3383/1869/890/1871/9134/9928/1029/3479/6573/3757/332 11
DOID:768DOID:768 retinoblastoma 10/18 116/8018 1.076659e-14 1.360963e-12 5.027589e-13 1869/890/1871/9134/9928/1029/3479/6573/3757/332 10
DOID:771DOID:771 retinal cell cancer 10/18 116/8018 1.076659e-14 1.360963e-12 5.027589e-13 1869/890/1871/9134/9928/1029/3479/6573/3757/332 10
DOID:4645DOID:4645 retinal cancer 10/18 118/8018 1.283927e-14 1.360963e-12 5.027589e-13 1869/890/1871/9134/9928/1029/3479/6573/3757/332 10
DOID:5679DOID:5679 retinal disease 12/18 455/8018 1.316791e-11 1.163165e-09 4.296896e-10 123/3383/1869/890/1871/9134/9928/1029/3479/6573/3757/33212

You can also plot the output of the enrichment.

NOTE: due to a bug in the current version of clusterProfiler, the following plot does not currently work. (will try to fix it next week)

In [32]:
plot(summary(myenrich.do))
Warning message in data.matrix(x):
“NAs introduced by coercion”Warning message in data.matrix(x):
“NAs introduced by coercion”Warning message in min(x):
“no non-missing arguments to min; returning Inf”Warning message in max(x):
“no non-missing arguments to max; returning -Inf”Warning message in min(x):
“no non-missing arguments to min; returning Inf”Warning message in max(x):
“no non-missing arguments to max; returning -Inf”
Error in plot.window(...): need finite 'xlim' values
Traceback:

1. plot(summary(myenrich.do))
2. plot(summary(myenrich.do))
3. plot.data.frame(summary(myenrich.do))
4. pairs(data.matrix(x), ...)
5. pairs.default(data.matrix(x), ...)
6. localPlot(x[, j], x[, i], xlab = "", ylab = "", axes = FALSE, 
 .     type = "n", ..., log = l)
7. plot(...)
8. plot.default(...)
9. localWindow(xlim, ylim, log, asp, ...)
10. plot.window(...)

Getting gene coordinates: the TxDB packages

Another package loaded with Homo.sapiens is a TxDb object, named TxDb.Hsapiens.UCSC.hg19.knownGene.

This contains coordinates of genes, transcripts and exons in the human genome.

We can get the gene coordinates using the genes() function:

In [33]:
genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
GRanges object with 23056 ranges and 1 metadata column:
        seqnames                 ranges strand   |     gene_id
           <Rle>              <IRanges>  <Rle>   | <character>
      1    chr19 [ 58858172,  58874214]      -   |           1
     10     chr8 [ 18248755,  18258723]      +   |          10
    100    chr20 [ 43248163,  43280376]      -   |         100
   1000    chr18 [ 25530930,  25757445]      -   |        1000
  10000     chr1 [243651535, 244006886]      -   |       10000
    ...      ...                    ...    ... ...         ...
   9991     chr9 [114979995, 115095944]      -   |        9991
   9992    chr21 [ 35736323,  35743440]      +   |        9992
   9993    chr22 [ 19023795,  19109967]      -   |        9993
   9994     chr6 [ 90539619,  90584155]      +   |        9994
   9997    chr22 [ 50961997,  50964905]      -   |        9997
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Getting promoters, cds, transcripts and exons

Apart from genes, we can extract coordinates of transcripts(), exons(), cds(), and promoters(), using the homonymous function.

The output of these functions can be customized to include additional columns, e.g. you may want to print both the exon and gene id:

In [34]:
exons(TxDb.Hsapiens.UCSC.hg19.knownGene)
GRanges object with 289969 ranges and 1 metadata column:
           seqnames               ranges strand   |   exon_id
              <Rle>            <IRanges>  <Rle>   | <integer>
       [1]     chr1       [11874, 12227]      +   |         1
       [2]     chr1       [12595, 12721]      +   |         2
       [3]     chr1       [12613, 12721]      +   |         3
       [4]     chr1       [12646, 12697]      +   |         4
       [5]     chr1       [13221, 14409]      +   |         5
       ...      ...                  ...    ... ...       ...
  [289965]     chrY [27607404, 27607432]      -   |    277746
  [289966]     chrY [27635919, 27635954]      -   |    277747
  [289967]     chrY [59358329, 59359508]      -   |    277748
  [289968]     chrY [59360007, 59360115]      -   |    277749
  [289969]     chrY [59360501, 59360854]      -   |    277750
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
In [35]:
exons(TxDb.Hsapiens.UCSC.hg19.knownGene, columns=c("exon_id", "gene_id"))
GRanges object with 289969 ranges and 2 metadata columns:
           seqnames               ranges strand   |   exon_id         gene_id
              <Rle>            <IRanges>  <Rle>   | <integer> <CharacterList>
       [1]     chr1       [11874, 12227]      +   |         1       100287102
       [2]     chr1       [12595, 12721]      +   |         2       100287102
       [3]     chr1       [12613, 12721]      +   |         3       100287102
       [4]     chr1       [12646, 12697]      +   |         4       100287102
       [5]     chr1       [13221, 14409]      +   |         5       100287102
       ...      ...                  ...    ... ...       ...             ...
  [289965]     chrY [27607404, 27607432]      -   |    277746              NA
  [289966]     chrY [27635919, 27635954]      -   |    277747              NA
  [289967]     chrY [59358329, 59359508]      -   |    277748              NA
  [289968]     chrY [59360007, 59360115]      -   |    277749              NA
  [289969]     chrY [59360501, 59360854]      -   |    277750              NA
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

To get the list of all possible additional columns, we can use the columns() function as for the org.Hs.eg.db package:

In [36]:
columns(TxDb.Hsapiens.UCSC.hg19.knownGene)
  1. 'CDSID'
  2. 'CDSNAME'
  3. 'CDSCHROM'
  4. 'CDSSTRAND'
  5. 'CDSSTART'
  6. 'CDSEND'
  7. 'EXONID'
  8. 'EXONNAME'
  9. 'EXONCHROM'
  10. 'EXONSTRAND'
  11. 'EXONSTART'
  12. 'EXONEND'
  13. 'GENEID'
  14. 'TXID'
  15. 'EXONRANK'
  16. 'TXNAME'
  17. 'TXTYPE'
  18. 'TXCHROM'
  19. 'TXSTRAND'
  20. 'TXSTART'
  21. 'TXEND'

Subsetting a GenomicRanges object

GenomicRanges object can be subsetted like normal data frames, using the subset function.

Chromosome names must be referred as "seqnames":

In [37]:
subset(exons(TxDb.Hsapiens.UCSC.hg19.knownGene), 
       seqnames=="chr6" & start>10000000 & end<20000000)
GRanges object with 569 ranges and 1 metadata column:
        seqnames               ranges strand   |   exon_id
           <Rle>            <IRanges>  <Rle>   | <integer>
    [1]     chr6 [10412551, 10412600]      +   |     86074
    [2]     chr6 [10414300, 10415284]      +   |     86075
    [3]     chr6 [10414914, 10415284]      +   |     86076
    [4]     chr6 [10415520, 10415613]      +   |     86077
    [5]     chr6 [10415520, 10416402]      +   |     86078
    ...      ...                  ...    ... ...       ...
  [565]     chr6 [19180383, 19180711]      -   |     93064
  [566]     chr6 [19642656, 19642760]      -   |     93065
  [567]     chr6 [19802395, 19804141]      -   |     93066
  [568]     chr6 [19804402, 19804508]      -   |     93067
  [569]     chr6 [19804906, 19804981]      -   |     93068
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Importing a BED file

BED and other formats (GFF, bigwig, etc..) can be imported using the import() function from rtracklayer.

Let's import a simple BED file:

In [38]:
library(rtracklayer)
marks = import("https://raw.githubusercontent.com/dalloliogm/belgrade_unix_intro/master/data/genes/example.bed",
      format="bed")
marks
GRanges object with 2 ranges and 2 metadata columns:
      seqnames                 ranges strand |        name     score
         <Rle>              <IRanges>  <Rle> | <character> <numeric>
  [1]     chr8 [145701383, 145702686]      + |       mark1        12
  [2]    chr20 [ 33850820,  33852880]      - |       mark2        52
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

If you have trouble downloading the file, just copy these contents and save it to a file called example.bed:

chr8    145701382    145702686    mark1    12    +
chr20    33850819    33852880    mark2    52    -

Operations on GenomicRanges: nearest

The BED file just imported contains coordinates of two regions.

We can use the nearest() function to find the genes that are closer to these regions:

In [39]:
human.genes = genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
closest.genes = human.genes[nearest(marks, human.genes)]
closest.genes$mark = marks$name
closest.genes
GRanges object with 2 ranges and 2 metadata columns:
        seqnames                 ranges strand |     gene_id        mark
           <Rle>              <IRanges>  <Rle> | <character> <character>
  90990     chr8 [145691738, 145699499]      + |       90990       mark1
  55741    chr20 [ 33703160,  33865960]      - |       55741       mark2
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Annotation Hub

AnnotationHub is a recently introduced library, which allows to download data from several sources, directly into R:

In [40]:
library(AnnotationHub)
ahub = AnnotationHub()
ahub
AnnotationHub with 35306 records
# snapshotDate(): 2016-08-15 
# $dataprovider: BroadInstitute, UCSC, Ensembl, NCBI, Haemcode, Inparanoid8,...
# $species: Homo sapiens, Mus musculus, Bos taurus, Pan troglodytes, Danio r...
# $rdataclass: GRanges, FaFile, BigWigFile, OrgDb, ChainFile, Inparanoid8Db,...
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype 
# retrieve records with, e.g., 'object[["AH2"]]' 

            title                                               
  AH2     | Ailuropoda_melanoleuca.ailMel1.69.dna.toplevel.fa   
  AH3     | Ailuropoda_melanoleuca.ailMel1.69.dna_rm.toplevel.fa
  AH4     | Ailuropoda_melanoleuca.ailMel1.69.dna_sm.toplevel.fa
  AH5     | Ailuropoda_melanoleuca.ailMel1.69.ncrna.fa          
  AH6     | Ailuropoda_melanoleuca.ailMel1.69.pep.all.fa        
  ...       ...                                                 
  AH49436 | Xiphophorus_maculatus.Xipmac4.4.2.dna_rm.toplevel.fa
  AH49437 | Xiphophorus_maculatus.Xipmac4.4.2.dna_sm.toplevel.fa
  AH49438 | Xiphophorus_maculatus.Xipmac4.4.2.dna.toplevel.fa   
  AH49439 | Xiphophorus_maculatus.Xipmac4.4.2.ncrna.fa          
  AH49440 | Xiphophorus_maculatus.Xipmac4.4.2.pep.all.fa        

Getting data with AnnotationHub

For an overview of the data available via AnnotationHub, we can run the display function:

ahub = AnnotationHub()
display(ahub)

In alternative, we can use the subset function:

In [41]:
subset(ahub, species == "Homo sapiens" & dataprovider == "UCSC")
AnnotationHub with 5958 records
# snapshotDate(): 2016-08-15 
# $dataprovider: UCSC
# $species: Homo sapiens
# $rdataclass: GRanges, ChainFile, TwoBitFile
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype 
# retrieve records with, e.g., 'object[["AH5012"]]' 

            title                                             
  AH5012  | Chromosome Band                                   
  AH5013  | STS Markers                                       
  AH5014  | FISH Clones                                       
  AH5015  | Recomb Rate                                       
  AH5016  | ENCODE Pilot                                      
  ...       ...                                               
  AH27618 | wgEncodeUwTfbsWerirb1CtcfStdPkRep2.narrowPeak.gz  
  AH27619 | wgEncodeUwTfbsWi38CtcfStdHotspotsRep1.broadPeak.gz
  AH27620 | wgEncodeUwTfbsWi38CtcfStdHotspotsRep2.broadPeak.gz
  AH27621 | wgEncodeUwTfbsWi38CtcfStdPkRep1.narrowPeak.gz     
  AH27622 | wgEncodeUwTfbsWi38CtcfStdPkRep2.narrowPeak.gz     

For more complex queries, we can also use query():

In [42]:
query(subset(ahub, species == "Homo sapiens" & dataprovider == "UCSC"), "Most cons")
AnnotationHub with 3 records
# snapshotDate(): 2016-08-15 
# $dataprovider: UCSC
# $species: Homo sapiens
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype 
# retrieve records with, e.g., 'object[["AH5224"]]' 

           title           
  AH5224 | 28-Way Most Cons
  AH5225 | 17-Way Most Cons
  AH5354 | Most Conserved  

Download a dataset from AnnotationHub

To download a dataset, we can use double square brackets, and the id of the dataset:

In [43]:
multi.24 = ahub[["AH5224"]]
multi.24
GRanges object with 2040420 ranges and 2 metadata columns:
                 seqnames               ranges strand   |        name     score
                    <Rle>            <IRanges>  <Rle>   | <character> <numeric>
        [1]          chr1 [25165820, 25165859]      *   |      lod=22       299
        [2]          chr1 [28311541, 28311557]      *   |      lod=14       244
        [3]          chr1 [29360118, 29360136]      *   |      lod=14       244
        [4]          chr1 [71303152, 71303427]      *   |     lod=462       674
        [5]          chr1 [88080377, 88080627]      *   |     lod=485       680
        ...           ...                  ...    ... ...         ...       ...
  [2040416] chr6_qbl_hap2   [4564430, 4564669]      *   |      lod=40       373
  [2040417] chr6_qbl_hap2   [4564824, 4564899]      *   |      lod=26       320
  [2040418] chr6_qbl_hap2   [4564975, 4565246]      *   |      lod=93       477
  [2040419] chr6_qbl_hap2   [4565271, 4565538]      *   |     lod=102       488
  [2040420] chr6_qbl_hap2   [4565616, 4565661]      *   |      lod=15       252
  -------
  seqinfo: 49 sequences from hg18 genome

Getting conservation scores for custom intervals

Now that we have conservation scores from the UCSC-Multiz table, we can get the scores for the marks bed file we imported earlier:

In [44]:
subsetByOverlaps(multi.24, marks)
GRanges object with 10 ranges and 2 metadata columns:
       seqnames                 ranges strand |        name     score
          <Rle>              <IRanges>  <Rle> | <character> <numeric>
   [1]     chr8 [145701434, 145701456]      * |      lod=19       281
   [2]     chr8 [145701503, 145701510]      * |      lod=13       235
   [3]     chr8 [145701542, 145701585]      * |      lod=35       357
   [4]     chr8 [145701653, 145701678]      * |      lod=24       310
   [5]     chr8 [145702037, 145702121]      * |      lod=36       360
   [6]     chr8 [145702184, 145702262]      * |      lod=65       433
   [7]     chr8 [145702272, 145702329]      * |      lod=36       360
   [8]     chr8 [145702421, 145702512]      * |      lod=46       390
   [9]    chr20 [ 33852458,  33852471]      * |      lod=15       252
  [10]    chr20 [ 33852712,  33852970]      * |     lod=428       665
  -------
  seqinfo: 49 sequences from hg18 genome

Another useful function is mergeByOverlaps, which combines information from both ranges:

In [45]:
mergeByOverlaps(marks, multi.24)
DataFrame with 10 rows and 6 columns
                            marks        name     score
                        <GRanges> <character> <numeric>
1   chr8:+:[145701383, 145702686]       mark1        12
2   chr8:+:[145701383, 145702686]       mark1        12
3   chr8:+:[145701383, 145702686]       mark1        12
4   chr8:+:[145701383, 145702686]       mark1        12
5   chr8:+:[145701383, 145702686]       mark1        12
6   chr8:+:[145701383, 145702686]       mark1        12
7   chr8:+:[145701383, 145702686]       mark1        12
8   chr8:+:[145701383, 145702686]       mark1        12
9  chr20:-:[ 33850820,  33852880]       mark2        52
10 chr20:-:[ 33850820,  33852880]       mark2        52
                         multi.24      name.1   score.1
                        <GRanges> <character> <numeric>
1   chr8:*:[145701434, 145701456]      lod=19       281
2   chr8:*:[145701503, 145701510]      lod=13       235
3   chr8:*:[145701542, 145701585]      lod=35       357
4   chr8:*:[145701653, 145701678]      lod=24       310
5   chr8:*:[145702037, 145702121]      lod=36       360
6   chr8:*:[145702184, 145702262]      lod=65       433
7   chr8:*:[145702272, 145702329]      lod=36       360
8   chr8:*:[145702421, 145702512]      lod=46       390
9  chr20:*:[ 33852458,  33852471]      lod=15       252
10 chr20:*:[ 33852712,  33852970]     lod=428       665

This can be used to calculate the average conservation by region. In this case we will use the dplyr library for the aggregation:

In [47]:
library(dplyr)
select = AnnotationDbi::select

mergeByOverlaps(marks, multi.24) %>% as.data.frame %>% group_by(marks.seqnames) %>% summarise(mean=mean(score.1))
marks.seqnamesmean
chr8 340.75
chr20 458.50