PEB Belgrade - Bioconductor workshop¶

Giovanni M. Dall'Olio, GlaxoSmithKline. http://bioinfoblog.it

Welcome to the Bioconductor / Data Integration workshop.

This workshop is heavily inspired by the Coursera Bioconductor course. See here for materials: http://kasperdanielhansen.github.io/genbioconductor/

Requirements¶

Course materials: http://dalloliogm.github.io

This workshop requires several bioconductor libraries, which take a while to install.

Please start their installation by copying&pasting the commands below. We'll continue the lecture while they get installed:

# bioconductor
source("http://bioconductor.org/biocLite.R")
biocLite("Homo.sapiens")
biocLite("rtracklayer")
biocLite("DOSE")
biocLite("clusterProfiler")
biocLite("AnnotationHub")

# dplyr
install.packages(c("dplyr"))

What is Bioconductor?¶

Bioconductor is the biggest repository of bioinformatics libraries for R. It contains code for dealing with genomic coordinates, calculating differential expression, converting gene symbols and ids, and much more.

To be included in bioconductor, a library must:

have documentation in the form of a vignette()
include a battery of tests to make sure that each new version is working
be citable, using citation() and possibly be described in a bioinformatics journal

How to get Bioconductor Packages¶

First, run the following:

source("http://bioconductor.org/biocLite.R")

At this point you can run the biocLite function to install any package available at http://bioconductor.org/packages/release/BiocViews.html#___Software

For example:

biocLite("limma")

will install the library limma (a popular library for diff. expression analysis)

Tip: The next time you run R, you can use library(BiocInstaller) instead of source("http://bioconductor.org/biocLite.R")

How to get help on a Bioconductor package¶

Overview of a package:

library(help=Homo.sapiens)
ls("package:Homo.sapiens")

Vignettes

library(Homo.sapiens)
vignette()

Given a function from a Bioconductor package, get a list of all the objects objects to which it can be applied:

showMethods("findOverlaps")

The Annotation packages in Bioconductor¶

Apart from analysis functions, Bioconductor contains several data packages (https://www.bioconductor.org/packages/release/data/annotation/), assembled from datasets of common interest for multiple organisms.

Some examples:

TxDB objects: coordinates for genes, transcripts, exons...
BSGenome: genome sequences
microarray ids (e.g. hgu133): conversions probe to genes for Affymetrix and Illumina arrays
**org.*.eg.db**: gene symbol to id conversion (entrez, ensembl, GO, ..)

In addition two packages allow to access large dataset repositories:

biomaRt: any biomart installation, e.g. ensembl, hgnc, (see http://www.biomart.org/)
AnnotationHub: access to several resources, e.g. any track in the UCSC browser, and more

In this tutorial we will see some of these (TxDB, org.eg.db, AnnotationHub).

Are Bioconductor Annotation packages up to date?¶

Most data packages are annotated regularly, with different frequency depending on the data type, from 6 months to 2 years.

This is a good compromise between reproducibility of analysis, efforts needed to maintain the packages, and getting recent data.

My species is not present or the data is outdated.¶

If your species is not present or you need more recent data, you can create a package on your own.

The procedure is well documented and usually require just a table as input. It's an easy way to become a Bioconductor developer and good for CV :-)

The Homo.sapiens package¶

Let's load the Homo.sapiens package. You will see that it will load several other packages:

In [19]:

library(Homo.sapiens)
library(help="Homo.sapiens")

Bioconductor contains similar wrapper packages for the most common model species (e.g. mouse, rat).

Gene symbols and IDs: the org.Hs.eg.db package¶

The org.*.eg.db packages allow to retrieve gene symbols and ids relative to a species (see list of all packages). The data is updated every two years, which is a good compromise between reproducibility and getting recent data.

If no org.eg.db package is available for your species, you may consider creating one yourself. You only require a table of genes/symbols, and following a tutorial

To see which data is included in the human package, we can open its help page:

library(help=org.Hs.eg.db)

In alternative, we can use the columns() function:

In [20]:

columns(org.Hs.eg.db)

'ENTREZID'
'PFAM'
'IPI'
'PROSITE'
'ACCNUM'
'ALIAS'
'CHR'
'CHRLOC'
'CHRLOCEND'
'ENZYME'
'MAP'
'PATH'
'PMID'
'REFSEQ'
'SYMBOL'
'UNIGENE'
'ENSEMBL'
'ENSEMBLPROT'
'ENSEMBLTRANS'
'GENENAME'
'UNIPROT'
'GO'
'EVIDENCE'
'ONTOLOGY'
'GOALL'
'EVIDENCEALL'
'ONTOLOGYALL'
'OMIM'
'UCSCKG'

This means that the org.Hs.eg.db package contains mapping between Entrez IDs, PFAM, Prosite, Genenames, GO, etc... for all human genes.

Visualizing a bimap as a table¶

The BiMaps in a library can also be accessed as variable. For example, org.Hs.eg.db contains many variables starting with "org.Hs.eg":

In [21]:

ls("package:org.Hs.eg.db")

'org.Hs.eg'
'org.Hs.egACCNUM'
'org.Hs.egACCNUM2EG'
'org.Hs.egALIAS2EG'
'org.Hs.egCHR'
'org.Hs.egCHRLENGTHS'
'org.Hs.egCHRLOC'
'org.Hs.egCHRLOCEND'
'org.Hs.eg.db'
'org.Hs.eg_dbconn'
'org.Hs.eg_dbfile'
'org.Hs.eg_dbInfo'
'org.Hs.eg_dbschema'
'org.Hs.egENSEMBL'
'org.Hs.egENSEMBL2EG'
'org.Hs.egENSEMBLPROT'
'org.Hs.egENSEMBLPROT2EG'
'org.Hs.egENSEMBLTRANS'
'org.Hs.egENSEMBLTRANS2EG'
'org.Hs.egENZYME'
'org.Hs.egENZYME2EG'
'org.Hs.egGENENAME'
'org.Hs.egGO'
'org.Hs.egGO2ALLEGS'
'org.Hs.egGO2EG'
'org.Hs.egMAP'
'org.Hs.egMAP2EG'
'org.Hs.egMAPCOUNTS'
'org.Hs.egOMIM'
'org.Hs.egOMIM2EG'
'org.Hs.egORGANISM'
'org.Hs.egPATH'
'org.Hs.egPATH2EG'
'org.Hs.egPFAM'
'org.Hs.egPMID'
'org.Hs.egPMID2EG'
'org.Hs.egPROSITE'
'org.Hs.egREFSEQ'
'org.Hs.egREFSEQ2EG'
'org.Hs.egSYMBOL'
'org.Hs.egSYMBOL2EG'
'org.Hs.egUCSCKG'
'org.Hs.egUNIGENE'
'org.Hs.egUNIGENE2EG'
'org.Hs.egUNIPROT'

These can be converted to table using as.data.frame or toTable. For example the following gets a dataframe of all KEGG pathways to Entrez:

In [22]:

head(toTable(org.Hs.egPATH2EG))

gene_id	path_id
2	04610
9	00232
9	00983
9	01100
10	00232
10	00983

To get the name and description of these KEGG pathways, we would have to load the KEGG.db data package.

These packages also allows to get more documentation:

?org.Hs.egPATH2EG

Entrez and symbols¶

The basic way to get data from these BiMap objects is the select() function from AnnotationDbi.

For example, here is how to get entrez and description of genes MGAT2 and MGAT3:

In [23]:

AnnotationDbi::select(org.Hs.eg.db, keys=as.character(c("MGAT2", "MGAT3")), keytype='SYMBOL', columns=c('ENTREZID','GENENAME'))

SYMBOL	ENTREZID	GENENAME
MGAT2	4247	mannosyl (alpha-1,6-)-glycoprotein beta-1,2-N-acetylglucosaminyltransferase
MGAT3	4248	mannosyl (beta-1,4-)-glycoprotein beta-1,4-N-acetylglucosaminyltransferase

The "keys" argument defines the symbols that we want to search. The "keytype" and "columns" define the type of input and output. Use the functions keytypes(org.Hs.eg.db) and columns(org.Hs.eg.db) to see which values are supported.

Exercise 1: symbol and Ensembl¶

What is the name and the Ensembl ID of the gene with Entrez ID 1234?

In [24]:

AnnotationDbi::select(org.Hs.eg.db, keys=as.character(c("1234")), keytype='ENTREZID', columns=c('SYMBOL','ENSEMBL'))

ENTREZID	SYMBOL	ENSEMBL
1234	CCR5	ENSG00000160791

Exercise 2- Gene Ontology¶

The Gene Ontology database annotates terms related to the biological process, molecular function, and cellular compartment for every gene.

Can you get all the Gene Ontology (GO) terms associated to PTEN?

In [25]:

head(select(org.Hs.eg.db, keys='PTEN', keytype='SYMBOL', columns='GO'))

SYMBOL	GO	EVIDENCE	ONTOLOGY
PTEN	GO:0000079	TAS	BP
PTEN	GO:0000287	IEA	MF
PTEN	GO:0001525	IEA	BP
PTEN	GO:0001933	IDA	BP
PTEN	GO:0002902	IEA	BP
PTEN	GO:0004438	IDA	MF

To get the definition of these GO ids, we can use the GO.db database:

In [26]:

PTEN.go = select(org.Hs.eg.db, keys='PTEN', keytype='SYMBOL', columns='GO')
PTEN.go$TERM = AnnotationDbi::select(GO.db, keys=as.character(PTEN.go$GO), columns="TERM")$TERM
head(PTEN.go)

SYMBOL	GO	EVIDENCE	ONTOLOGY	TERM
PTEN	GO:0000079	TAS	BP	regulation of cyclin-dependent protein serine/threonine kinase activity
PTEN	GO:0000287	IEA	MF	magnesium ion binding
PTEN	GO:0001525	IEA	BP	angiogenesis
PTEN	GO:0001933	IDA	BP	negative regulation of protein phosphorylation
PTEN	GO:0002902	IEA	BP	regulation of B cell apoptotic process
PTEN	GO:0004438	IDA	MF	phosphatidylinositol-3-phosphatase activity

Gene Ontology Enrichment with DOSE and clusterProfiler¶

When we have a long list of genes, listing all the Gene Ontology terms associated to each gene produces a very verbose output, which is difficult to use.

It is more useful to do an Ontology Enrichment analysis, to see which are the most represented terms in the list.

There are several functions to do enrichment analysis in BiocConductor, but today we will use the enrichGO function from clusterProfiler:

In [27]:

library(clusterProfiler)
library(DOSE)

mygenes = c(23450, 5160, 7126, 27118, 8452, 3675, 6624, 3638, 9918, 123, 213, 3383,1869,890,1871,9134,9928,1029,3479,6573,3757,332)
print(mygenes[1:10])

 [1] 23450  5160  7126 27118  8452  3675  6624  3638  9918   123

In [28]:

# Let's calculate the enrichment
myenrich = enrichGO(mygenes, "human", ont="MF", readable=F)
head(summary(myenrich))

	ID	Description	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	geneID	Count
GO:0019899	GO:0019899	enzyme binding	9/20	1481/18585	9.460737e-06	0.001712393	0.001145247	1029/213/332/3675/3757/7126/8452/890/9928	9
GO:0005178	GO:0005178	integrin binding	3/20	104/18585	1.810737e-04	0.009349881	0.006253192	3383/3479/3675	3
GO:0032403	GO:0032403	protein complex binding	6/20	852/18585	2.032337e-04	0.009349881	0.006253192	332/3383/3479/3675/6624/9928	6
GO:0016538	GO:0016538	cyclin-dependent protein serine/threonine kinase regulator activity	2/20	20/18585	2.066272e-04	0.009349881	0.006253192	1029/9134	2
GO:0019904	GO:0019904	protein domain specific binding	5/20	588/18585	3.255709e-04	0.011785666	0.007882242	3675/3757/7126/8452/9928	5
GO:0050839	GO:0050839	cell adhesion molecule binding	3/20	171/18585	7.775958e-04	0.023457474	0.015688337	3383/3479/3675	3

Behind the lines, enrichGO retrieves the GO annotations from the org.Hs.eg.db package, and calculates the enrichment.

The output can be plot in several ways. For example as a graph, using cnetplot:

In [29]:

cnetplot(myenrich)

Enrichment using other ontologies¶

We can also do an enrichment using other ontologies.

What is an Ontology?¶

An ontology is a controlled dictionary for a category of terms. They allow to avoid ambigueties and misunderstandings, by defining the standard name of a term (e.g. epilepsy) and annotating all the possible synonyms (e.g. epilepsy syndrome)

In bioinformatics the most famous ontology is GO, but there are many others. For example MeSH annotates medical terms, and Disease Ontology (subset of MeSH) annotates disease names.

When an Ontology is provided with mappings between terms and genes, we can use it to calculate enrichment. For example, we can use Disease Ontology to see which diseases are enriched in our list:

In [30]:

library(DOSE)
myenrich.do = enrichDO(mygenes)
head(summary(myenrich.do))

	ID	Description	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	geneID	Count
DOID:0060116	DOID:0060116	sensory system cancer	11/18	139/8018	8.183720e-16	2.168686e-13	8.011431e-14	3383/1869/890/1871/9134/9928/1029/3479/6573/3757/332	11
DOID:2174	DOID:2174	ocular cancer	11/18	139/8018	8.183720e-16	2.168686e-13	8.011431e-14	3383/1869/890/1871/9134/9928/1029/3479/6573/3757/332	11
DOID:768	DOID:768	retinoblastoma	10/18	116/8018	1.076659e-14	1.360963e-12	5.027589e-13	1869/890/1871/9134/9928/1029/3479/6573/3757/332	10
DOID:771	DOID:771	retinal cell cancer	10/18	116/8018	1.076659e-14	1.360963e-12	5.027589e-13	1869/890/1871/9134/9928/1029/3479/6573/3757/332	10
DOID:4645	DOID:4645	retinal cancer	10/18	118/8018	1.283927e-14	1.360963e-12	5.027589e-13	1869/890/1871/9134/9928/1029/3479/6573/3757/332	10
DOID:5679	DOID:5679	retinal disease	12/18	455/8018	1.316791e-11	1.163165e-09	4.296896e-10	123/3383/1869/890/1871/9134/9928/1029/3479/6573/3757/332	12

You can also plot the output of the enrichment.

NOTE: due to a bug in the current version of clusterProfiler, the following plot does not currently work. (will try to fix it next week)

In [32]:

plot(summary(myenrich.do))

Warning message in data.matrix(x):
“NAs introduced by coercion”Warning message in data.matrix(x):
“NAs introduced by coercion”Warning message in min(x):
“no non-missing arguments to min; returning Inf”Warning message in max(x):
“no non-missing arguments to max; returning -Inf”Warning message in min(x):
“no non-missing arguments to min; returning Inf”Warning message in max(x):
“no non-missing arguments to max; returning -Inf”

Error in plot.window(...): need finite 'xlim' values
Traceback:

1. plot(summary(myenrich.do))
2. plot(summary(myenrich.do))
3. plot.data.frame(summary(myenrich.do))
4. pairs(data.matrix(x), ...)
5. pairs.default(data.matrix(x), ...)
6. localPlot(x[, j], x[, i], xlab = "", ylab = "", axes = FALSE, 
 .     type = "n", ..., log = l)
7. plot(...)
8. plot.default(...)
9. localWindow(xlim, ylim, log, asp, ...)
10. plot.window(...)

Getting gene coordinates: the TxDB packages¶

Another package loaded with Homo.sapiens is a TxDb object, named TxDb.Hsapiens.UCSC.hg19.knownGene.

This contains coordinates of genes, transcripts and exons in the human genome.

We can get the gene coordinates using the genes() function:

In [33]:

genes(TxDb.Hsapiens.UCSC.hg19.knownGene)

GRanges object with 23056 ranges and 1 metadata column:
        seqnames                 ranges strand   |     gene_id
           <Rle>              <IRanges>  <Rle>   | <character>
      1    chr19 [ 58858172,  58874214]      -   |           1
     10     chr8 [ 18248755,  18258723]      +   |          10
    100    chr20 [ 43248163,  43280376]      -   |         100
   1000    chr18 [ 25530930,  25757445]      -   |        1000
  10000     chr1 [243651535, 244006886]      -   |       10000
    ...      ...                    ...    ... ...         ...
   9991     chr9 [114979995, 115095944]      -   |        9991
   9992    chr21 [ 35736323,  35743440]      +   |        9992
   9993    chr22 [ 19023795,  19109967]      -   |        9993
   9994     chr6 [ 90539619,  90584155]      +   |        9994
   9997    chr22 [ 50961997,  50964905]      -   |        9997
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Getting promoters, cds, transcripts and exons¶

Apart from genes, we can extract coordinates of transcripts(), exons(), cds(), and promoters(), using the homonymous function.

The output of these functions can be customized to include additional columns, e.g. you may want to print both the exon and gene id:

In [34]:

exons(TxDb.Hsapiens.UCSC.hg19.knownGene)

GRanges object with 289969 ranges and 1 metadata column:
           seqnames               ranges strand   |   exon_id
              <Rle>            <IRanges>  <Rle>   | <integer>
       [1]     chr1       [11874, 12227]      +   |         1
       [2]     chr1       [12595, 12721]      +   |         2
       [3]     chr1       [12613, 12721]      +   |         3
       [4]     chr1       [12646, 12697]      +   |         4
       [5]     chr1       [13221, 14409]      +   |         5
       ...      ...                  ...    ... ...       ...
  [289965]     chrY [27607404, 27607432]      -   |    277746
  [289966]     chrY [27635919, 27635954]      -   |    277747
  [289967]     chrY [59358329, 59359508]      -   |    277748
  [289968]     chrY [59360007, 59360115]      -   |    277749
  [289969]     chrY [59360501, 59360854]      -   |    277750
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

In [35]:

exons(TxDb.Hsapiens.UCSC.hg19.knownGene, columns=c("exon_id", "gene_id"))

GRanges object with 289969 ranges and 2 metadata columns:
           seqnames               ranges strand   |   exon_id         gene_id
              <Rle>            <IRanges>  <Rle>   | <integer> <CharacterList>
       [1]     chr1       [11874, 12227]      +   |         1       100287102
       [2]     chr1       [12595, 12721]      +   |         2       100287102
       [3]     chr1       [12613, 12721]      +   |         3       100287102
       [4]     chr1       [12646, 12697]      +   |         4       100287102
       [5]     chr1       [13221, 14409]      +   |         5       100287102
       ...      ...                  ...    ... ...       ...             ...
  [289965]     chrY [27607404, 27607432]      -   |    277746              NA
  [289966]     chrY [27635919, 27635954]      -   |    277747              NA
  [289967]     chrY [59358329, 59359508]      -   |    277748              NA
  [289968]     chrY [59360007, 59360115]      -   |    277749              NA
  [289969]     chrY [59360501, 59360854]      -   |    277750              NA
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

To get the list of all possible additional columns, we can use the columns() function as for the org.Hs.eg.db package:

In [36]:

columns(TxDb.Hsapiens.UCSC.hg19.knownGene)

'CDSID'
'CDSNAME'
'CDSCHROM'
'CDSSTRAND'
'CDSSTART'
'CDSEND'
'EXONID'
'EXONNAME'
'EXONCHROM'
'EXONSTRAND'
'EXONSTART'
'EXONEND'
'GENEID'
'TXID'
'EXONRANK'
'TXNAME'
'TXTYPE'
'TXCHROM'
'TXSTRAND'
'TXSTART'
'TXEND'

Subsetting a GenomicRanges object¶

GenomicRanges object can be subsetted like normal data frames, using the subset function.

Chromosome names must be referred as "seqnames":

In [37]:

subset(exons(TxDb.Hsapiens.UCSC.hg19.knownGene), 
       seqnames=="chr6" & start>10000000 & end<20000000)

GRanges object with 569 ranges and 1 metadata column:
        seqnames               ranges strand   |   exon_id
           <Rle>            <IRanges>  <Rle>   | <integer>
    [1]     chr6 [10412551, 10412600]      +   |     86074
    [2]     chr6 [10414300, 10415284]      +   |     86075
    [3]     chr6 [10414914, 10415284]      +   |     86076
    [4]     chr6 [10415520, 10415613]      +   |     86077
    [5]     chr6 [10415520, 10416402]      +   |     86078
    ...      ...                  ...    ... ...       ...
  [565]     chr6 [19180383, 19180711]      -   |     93064
  [566]     chr6 [19642656, 19642760]      -   |     93065
  [567]     chr6 [19802395, 19804141]      -   |     93066
  [568]     chr6 [19804402, 19804508]      -   |     93067
  [569]     chr6 [19804906, 19804981]      -   |     93068
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Importing a BED file¶

BED and other formats (GFF, bigwig, etc..) can be imported using the import() function from rtracklayer.

Let's import a simple BED file:

In [38]:

library(rtracklayer)
marks = import("https://raw.githubusercontent.com/dalloliogm/belgrade_unix_intro/master/data/genes/example.bed",
      format="bed")
marks

GRanges object with 2 ranges and 2 metadata columns:
      seqnames                 ranges strand |        name     score
         <Rle>              <IRanges>  <Rle> | <character> <numeric>
  [1]     chr8 [145701383, 145702686]      + |       mark1        12
  [2]    chr20 [ 33850820,  33852880]      - |       mark2        52
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

If you have trouble downloading the file, just copy these contents and save it to a file called example.bed:

chr8	145701382	145702686	mark1	12	+
chr20	33850819	33852880	mark2	52	-

Operations on GenomicRanges: nearest¶

The BED file just imported contains coordinates of two regions.

We can use the nearest() function to find the genes that are closer to these regions:

In [39]:

human.genes = genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
closest.genes = human.genes[nearest(marks, human.genes)]
closest.genes$mark = marks$name
closest.genes

GRanges object with 2 ranges and 2 metadata columns:
        seqnames                 ranges strand |     gene_id        mark
           <Rle>              <IRanges>  <Rle> | <character> <character>
  90990     chr8 [145691738, 145699499]      + |       90990       mark1
  55741    chr20 [ 33703160,  33865960]      - |       55741       mark2
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Annotation Hub¶

AnnotationHub is a recently introduced library, which allows to download data from several sources, directly into R:

In [40]:

library(AnnotationHub)
ahub = AnnotationHub()
ahub

AnnotationHub with 35306 records
# snapshotDate(): 2016-08-15 
# $dataprovider: BroadInstitute, UCSC, Ensembl, NCBI, Haemcode, Inparanoid8,...
# $species: Homo sapiens, Mus musculus, Bos taurus, Pan troglodytes, Danio r...
# $rdataclass: GRanges, FaFile, BigWigFile, OrgDb, ChainFile, Inparanoid8Db,...
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype 
# retrieve records with, e.g., 'object[["AH2"]]' 

            title                                               
  AH2     | Ailuropoda_melanoleuca.ailMel1.69.dna.toplevel.fa   
  AH3     | Ailuropoda_melanoleuca.ailMel1.69.dna_rm.toplevel.fa
  AH4     | Ailuropoda_melanoleuca.ailMel1.69.dna_sm.toplevel.fa
  AH5     | Ailuropoda_melanoleuca.ailMel1.69.ncrna.fa          
  AH6     | Ailuropoda_melanoleuca.ailMel1.69.pep.all.fa        
  ...       ...                                                 
  AH49436 | Xiphophorus_maculatus.Xipmac4.4.2.dna_rm.toplevel.fa
  AH49437 | Xiphophorus_maculatus.Xipmac4.4.2.dna_sm.toplevel.fa
  AH49438 | Xiphophorus_maculatus.Xipmac4.4.2.dna.toplevel.fa   
  AH49439 | Xiphophorus_maculatus.Xipmac4.4.2.ncrna.fa          
  AH49440 | Xiphophorus_maculatus.Xipmac4.4.2.pep.all.fa

Getting data with AnnotationHub¶

For an overview of the data available via AnnotationHub, we can run the display function:

ahub = AnnotationHub()
display(ahub)

In alternative, we can use the subset function:

In [41]:

subset(ahub, species == "Homo sapiens" & dataprovider == "UCSC")

AnnotationHub with 5958 records
# snapshotDate(): 2016-08-15 
# $dataprovider: UCSC
# $species: Homo sapiens
# $rdataclass: GRanges, ChainFile, TwoBitFile
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype 
# retrieve records with, e.g., 'object[["AH5012"]]' 

            title                                             
  AH5012  | Chromosome Band                                   
  AH5013  | STS Markers                                       
  AH5014  | FISH Clones                                       
  AH5015  | Recomb Rate                                       
  AH5016  | ENCODE Pilot                                      
  ...       ...                                               
  AH27618 | wgEncodeUwTfbsWerirb1CtcfStdPkRep2.narrowPeak.gz  
  AH27619 | wgEncodeUwTfbsWi38CtcfStdHotspotsRep1.broadPeak.gz
  AH27620 | wgEncodeUwTfbsWi38CtcfStdHotspotsRep2.broadPeak.gz
  AH27621 | wgEncodeUwTfbsWi38CtcfStdPkRep1.narrowPeak.gz     
  AH27622 | wgEncodeUwTfbsWi38CtcfStdPkRep2.narrowPeak.gz

For more complex queries, we can also use query():

In [42]:

query(subset(ahub, species == "Homo sapiens" & dataprovider == "UCSC"), "Most cons")

AnnotationHub with 3 records
# snapshotDate(): 2016-08-15 
# $dataprovider: UCSC
# $species: Homo sapiens
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype 
# retrieve records with, e.g., 'object[["AH5224"]]' 

           title           
  AH5224 | 28-Way Most Cons
  AH5225 | 17-Way Most Cons
  AH5354 | Most Conserved

Download a dataset from AnnotationHub¶

To download a dataset, we can use double square brackets, and the id of the dataset:

In [43]:

multi.24 = ahub[["AH5224"]]
multi.24

GRanges object with 2040420 ranges and 2 metadata columns:
                 seqnames               ranges strand   |        name     score
                    <Rle>            <IRanges>  <Rle>   | <character> <numeric>
        [1]          chr1 [25165820, 25165859]      *   |      lod=22       299
        [2]          chr1 [28311541, 28311557]      *   |      lod=14       244
        [3]          chr1 [29360118, 29360136]      *   |      lod=14       244
        [4]          chr1 [71303152, 71303427]      *   |     lod=462       674
        [5]          chr1 [88080377, 88080627]      *   |     lod=485       680
        ...           ...                  ...    ... ...         ...       ...
  [2040416] chr6_qbl_hap2   [4564430, 4564669]      *   |      lod=40       373
  [2040417] chr6_qbl_hap2   [4564824, 4564899]      *   |      lod=26       320
  [2040418] chr6_qbl_hap2   [4564975, 4565246]      *   |      lod=93       477
  [2040419] chr6_qbl_hap2   [4565271, 4565538]      *   |     lod=102       488
  [2040420] chr6_qbl_hap2   [4565616, 4565661]      *   |      lod=15       252
  -------
  seqinfo: 49 sequences from hg18 genome

Getting conservation scores for custom intervals¶

Now that we have conservation scores from the UCSC-Multiz table, we can get the scores for the marks bed file we imported earlier:

In [44]:

subsetByOverlaps(multi.24, marks)

GRanges object with 10 ranges and 2 metadata columns:
       seqnames                 ranges strand |        name     score
          <Rle>              <IRanges>  <Rle> | <character> <numeric>
   [1]     chr8 [145701434, 145701456]      * |      lod=19       281
   [2]     chr8 [145701503, 145701510]      * |      lod=13       235
   [3]     chr8 [145701542, 145701585]      * |      lod=35       357
   [4]     chr8 [145701653, 145701678]      * |      lod=24       310
   [5]     chr8 [145702037, 145702121]      * |      lod=36       360
   [6]     chr8 [145702184, 145702262]      * |      lod=65       433
   [7]     chr8 [145702272, 145702329]      * |      lod=36       360
   [8]     chr8 [145702421, 145702512]      * |      lod=46       390
   [9]    chr20 [ 33852458,  33852471]      * |      lod=15       252
  [10]    chr20 [ 33852712,  33852970]      * |     lod=428       665
  -------
  seqinfo: 49 sequences from hg18 genome

Another useful function is mergeByOverlaps, which combines information from both ranges:

In [45]:

mergeByOverlaps(marks, multi.24)

DataFrame with 10 rows and 6 columns
                            marks        name     score
                        <GRanges> <character> <numeric>
1   chr8:+:[145701383, 145702686]       mark1        12
2   chr8:+:[145701383, 145702686]       mark1        12
3   chr8:+:[145701383, 145702686]       mark1        12
4   chr8:+:[145701383, 145702686]       mark1        12
5   chr8:+:[145701383, 145702686]       mark1        12
6   chr8:+:[145701383, 145702686]       mark1        12
7   chr8:+:[145701383, 145702686]       mark1        12
8   chr8:+:[145701383, 145702686]       mark1        12
9  chr20:-:[ 33850820,  33852880]       mark2        52
10 chr20:-:[ 33850820,  33852880]       mark2        52
                         multi.24      name.1   score.1
                        <GRanges> <character> <numeric>
1   chr8:*:[145701434, 145701456]      lod=19       281
2   chr8:*:[145701503, 145701510]      lod=13       235
3   chr8:*:[145701542, 145701585]      lod=35       357
4   chr8:*:[145701653, 145701678]      lod=24       310
5   chr8:*:[145702037, 145702121]      lod=36       360
6   chr8:*:[145702184, 145702262]      lod=65       433
7   chr8:*:[145702272, 145702329]      lod=36       360
8   chr8:*:[145702421, 145702512]      lod=46       390
9  chr20:*:[ 33852458,  33852471]      lod=15       252
10 chr20:*:[ 33852712,  33852970]     lod=428       665

This can be used to calculate the average conservation by region. In this case we will use the dplyr library for the aggregation:

In [47]:

library(dplyr)
select = AnnotationDbi::select

mergeByOverlaps(marks, multi.24) %>% as.data.frame %>% group_by(marks.seqnames) %>% summarise(mean=mean(score.1))

marks.seqnames	mean
chr8	340.75
chr20	458.50