Giovanni M. Dall'Olio, GlaxoSmithKline. http://bioinfoblog.it
Welcome to the Bioconductor / Data Integration workshop.
This workshop is heavily inspired by the Coursera Bioconductor course. See here for materials: http://kasperdanielhansen.github.io/genbioconductor/
Course materials: http://dalloliogm.github.io
This workshop requires several bioconductor libraries, which take a while to install.
Please start their installation by copying&pasting the commands below. We'll continue the lecture while they get installed:
# bioconductor
source("http://bioconductor.org/biocLite.R")
biocLite("Homo.sapiens")
biocLite("rtracklayer")
biocLite("DOSE")
biocLite("clusterProfiler")
biocLite("AnnotationHub")
# dplyr
install.packages(c("dplyr"))
Bioconductor is the biggest repository of bioinformatics libraries for R. It contains code for dealing with genomic coordinates, calculating differential expression, converting gene symbols and ids, and much more.
To be included in bioconductor, a library must:
First, run the following:
source("http://bioconductor.org/biocLite.R")
At this point you can run the biocLite function to install any package available at http://bioconductor.org/packages/release/BiocViews.html#___Software
For example:
biocLite("limma")
will install the library limma (a popular library for diff. expression analysis)
Tip: The next time you run R, you can use library(BiocInstaller) instead of source("http://bioconductor.org/biocLite.R")
Overview of a package:
library(help=Homo.sapiens)
ls("package:Homo.sapiens")
Vignettes
library(Homo.sapiens)
vignette()
Given a function from a Bioconductor package, get a list of all the objects objects to which it can be applied:
showMethods("findOverlaps")
Apart from analysis functions, Bioconductor contains several data packages (https://www.bioconductor.org/packages/release/data/annotation/), assembled from datasets of common interest for multiple organisms.
Some examples:
In addition two packages allow to access large dataset repositories:
In this tutorial we will see some of these (TxDB, org.eg.db, AnnotationHub).
Most data packages are annotated regularly, with different frequency depending on the data type, from 6 months to 2 years.
This is a good compromise between reproducibility of analysis, efforts needed to maintain the packages, and getting recent data.
If your species is not present or you need more recent data, you can create a package on your own.
The procedure is well documented and usually require just a table as input. It's an easy way to become a Bioconductor developer and good for CV :-)
Let's load the Homo.sapiens package. You will see that it will load several other packages:
library(Homo.sapiens)
library(help="Homo.sapiens")
Bioconductor contains similar wrapper packages for the most common model species (e.g. mouse, rat).
The org.*.eg.db packages allow to retrieve gene symbols and ids relative to a species (see list of all packages). The data is updated every two years, which is a good compromise between reproducibility and getting recent data.
If no org.eg.db package is available for your species, you may consider creating one yourself. You only require a table of genes/symbols, and following a tutorial
To see which data is included in the human package, we can open its help page:
library(help=org.Hs.eg.db)
In alternative, we can use the columns() function:
columns(org.Hs.eg.db)
This means that the org.Hs.eg.db package contains mapping between Entrez IDs, PFAM, Prosite, Genenames, GO, etc... for all human genes.
The BiMaps in a library can also be accessed as variable. For example, org.Hs.eg.db contains many variables starting with "org.Hs.eg":
ls("package:org.Hs.eg.db")
These can be converted to table using as.data.frame or toTable. For example the following gets a dataframe of all KEGG pathways to Entrez:
head(toTable(org.Hs.egPATH2EG))
gene_id | path_id |
---|---|
2 | 04610 |
9 | 00232 |
9 | 00983 |
9 | 01100 |
10 | 00232 |
10 | 00983 |
To get the name and description of these KEGG pathways, we would have to load the KEGG.db data package.
These packages also allows to get more documentation:
?org.Hs.egPATH2EG
The basic way to get data from these BiMap objects is the select() function from AnnotationDbi.
For example, here is how to get entrez and description of genes MGAT2 and MGAT3:
AnnotationDbi::select(org.Hs.eg.db, keys=as.character(c("MGAT2", "MGAT3")), keytype='SYMBOL', columns=c('ENTREZID','GENENAME'))
SYMBOL | ENTREZID | GENENAME |
---|---|---|
MGAT2 | 4247 | mannosyl (alpha-1,6-)-glycoprotein beta-1,2-N-acetylglucosaminyltransferase |
MGAT3 | 4248 | mannosyl (beta-1,4-)-glycoprotein beta-1,4-N-acetylglucosaminyltransferase |
The "keys" argument defines the symbols that we want to search. The "keytype" and "columns" define the type of input and output. Use the functions keytypes(org.Hs.eg.db) and columns(org.Hs.eg.db) to see which values are supported.
What is the name and the Ensembl ID of the gene with Entrez ID 1234?
AnnotationDbi::select(org.Hs.eg.db, keys=as.character(c("1234")), keytype='ENTREZID', columns=c('SYMBOL','ENSEMBL'))
ENTREZID | SYMBOL | ENSEMBL |
---|---|---|
1234 | CCR5 | ENSG00000160791 |
The Gene Ontology database annotates terms related to the biological process, molecular function, and cellular compartment for every gene.
Can you get all the Gene Ontology (GO) terms associated to PTEN?
head(select(org.Hs.eg.db, keys='PTEN', keytype='SYMBOL', columns='GO'))
SYMBOL | GO | EVIDENCE | ONTOLOGY |
---|---|---|---|
PTEN | GO:0000079 | TAS | BP |
PTEN | GO:0000287 | IEA | MF |
PTEN | GO:0001525 | IEA | BP |
PTEN | GO:0001933 | IDA | BP |
PTEN | GO:0002902 | IEA | BP |
PTEN | GO:0004438 | IDA | MF |
To get the definition of these GO ids, we can use the GO.db database:
PTEN.go = select(org.Hs.eg.db, keys='PTEN', keytype='SYMBOL', columns='GO')
PTEN.go$TERM = AnnotationDbi::select(GO.db, keys=as.character(PTEN.go$GO), columns="TERM")$TERM
head(PTEN.go)
SYMBOL | GO | EVIDENCE | ONTOLOGY | TERM |
---|---|---|---|---|
PTEN | GO:0000079 | TAS | BP | regulation of cyclin-dependent protein serine/threonine kinase activity |
PTEN | GO:0000287 | IEA | MF | magnesium ion binding |
PTEN | GO:0001525 | IEA | BP | angiogenesis |
PTEN | GO:0001933 | IDA | BP | negative regulation of protein phosphorylation |
PTEN | GO:0002902 | IEA | BP | regulation of B cell apoptotic process |
PTEN | GO:0004438 | IDA | MF | phosphatidylinositol-3-phosphatase activity |
When we have a long list of genes, listing all the Gene Ontology terms associated to each gene produces a very verbose output, which is difficult to use.
It is more useful to do an Ontology Enrichment analysis, to see which are the most represented terms in the list.
There are several functions to do enrichment analysis in BiocConductor, but today we will use the enrichGO function from clusterProfiler:
library(clusterProfiler)
library(DOSE)
mygenes = c(23450, 5160, 7126, 27118, 8452, 3675, 6624, 3638, 9918, 123, 213, 3383,1869,890,1871,9134,9928,1029,3479,6573,3757,332)
print(mygenes[1:10])
[1] 23450 5160 7126 27118 8452 3675 6624 3638 9918 123
# Let's calculate the enrichment
myenrich = enrichGO(mygenes, "human", ont="MF", readable=F)
head(summary(myenrich))
ID | Description | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | geneID | Count | |
---|---|---|---|---|---|---|---|---|---|
GO:0019899 | GO:0019899 | enzyme binding | 9/20 | 1481/18585 | 9.460737e-06 | 0.001712393 | 0.001145247 | 1029/213/332/3675/3757/7126/8452/890/9928 | 9 |
GO:0005178 | GO:0005178 | integrin binding | 3/20 | 104/18585 | 1.810737e-04 | 0.009349881 | 0.006253192 | 3383/3479/3675 | 3 |
GO:0032403 | GO:0032403 | protein complex binding | 6/20 | 852/18585 | 2.032337e-04 | 0.009349881 | 0.006253192 | 332/3383/3479/3675/6624/9928 | 6 |
GO:0016538 | GO:0016538 | cyclin-dependent protein serine/threonine kinase regulator activity | 2/20 | 20/18585 | 2.066272e-04 | 0.009349881 | 0.006253192 | 1029/9134 | 2 |
GO:0019904 | GO:0019904 | protein domain specific binding | 5/20 | 588/18585 | 3.255709e-04 | 0.011785666 | 0.007882242 | 3675/3757/7126/8452/9928 | 5 |
GO:0050839 | GO:0050839 | cell adhesion molecule binding | 3/20 | 171/18585 | 7.775958e-04 | 0.023457474 | 0.015688337 | 3383/3479/3675 | 3 |
Behind the lines, enrichGO retrieves the GO annotations from the org.Hs.eg.db package, and calculates the enrichment.
The output can be plot in several ways. For example as a graph, using cnetplot:
cnetplot(myenrich)
We can also do an enrichment using other ontologies.
An ontology is a controlled dictionary for a category of terms. They allow to avoid ambigueties and misunderstandings, by defining the standard name of a term (e.g. epilepsy) and annotating all the possible synonyms (e.g. epilepsy syndrome)
In bioinformatics the most famous ontology is GO, but there are many others. For example MeSH annotates medical terms, and Disease Ontology (subset of MeSH) annotates disease names.
When an Ontology is provided with mappings between terms and genes, we can use it to calculate enrichment. For example, we can use Disease Ontology to see which diseases are enriched in our list:
library(DOSE)
myenrich.do = enrichDO(mygenes)
head(summary(myenrich.do))
ID | Description | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | geneID | Count | |
---|---|---|---|---|---|---|---|---|---|
DOID:0060116 | DOID:0060116 | sensory system cancer | 11/18 | 139/8018 | 8.183720e-16 | 2.168686e-13 | 8.011431e-14 | 3383/1869/890/1871/9134/9928/1029/3479/6573/3757/332 | 11 |
DOID:2174 | DOID:2174 | ocular cancer | 11/18 | 139/8018 | 8.183720e-16 | 2.168686e-13 | 8.011431e-14 | 3383/1869/890/1871/9134/9928/1029/3479/6573/3757/332 | 11 |
DOID:768 | DOID:768 | retinoblastoma | 10/18 | 116/8018 | 1.076659e-14 | 1.360963e-12 | 5.027589e-13 | 1869/890/1871/9134/9928/1029/3479/6573/3757/332 | 10 |
DOID:771 | DOID:771 | retinal cell cancer | 10/18 | 116/8018 | 1.076659e-14 | 1.360963e-12 | 5.027589e-13 | 1869/890/1871/9134/9928/1029/3479/6573/3757/332 | 10 |
DOID:4645 | DOID:4645 | retinal cancer | 10/18 | 118/8018 | 1.283927e-14 | 1.360963e-12 | 5.027589e-13 | 1869/890/1871/9134/9928/1029/3479/6573/3757/332 | 10 |
DOID:5679 | DOID:5679 | retinal disease | 12/18 | 455/8018 | 1.316791e-11 | 1.163165e-09 | 4.296896e-10 | 123/3383/1869/890/1871/9134/9928/1029/3479/6573/3757/332 | 12 |
You can also plot the output of the enrichment.
NOTE: due to a bug in the current version of clusterProfiler, the following plot does not currently work. (will try to fix it next week)
plot(summary(myenrich.do))
Warning message in data.matrix(x): “NAs introduced by coercion”Warning message in data.matrix(x): “NAs introduced by coercion”Warning message in min(x): “no non-missing arguments to min; returning Inf”Warning message in max(x): “no non-missing arguments to max; returning -Inf”Warning message in min(x): “no non-missing arguments to min; returning Inf”Warning message in max(x): “no non-missing arguments to max; returning -Inf”
Error in plot.window(...): need finite 'xlim' values Traceback: 1. plot(summary(myenrich.do)) 2. plot(summary(myenrich.do)) 3. plot.data.frame(summary(myenrich.do)) 4. pairs(data.matrix(x), ...) 5. pairs.default(data.matrix(x), ...) 6. localPlot(x[, j], x[, i], xlab = "", ylab = "", axes = FALSE, . type = "n", ..., log = l) 7. plot(...) 8. plot.default(...) 9. localWindow(xlim, ylim, log, asp, ...) 10. plot.window(...)
Another package loaded with Homo.sapiens is a TxDb object, named TxDb.Hsapiens.UCSC.hg19.knownGene.
This contains coordinates of genes, transcripts and exons in the human genome.
We can get the gene coordinates using the genes() function:
genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
GRanges object with 23056 ranges and 1 metadata column: seqnames ranges strand | gene_id <Rle> <IRanges> <Rle> | <character> 1 chr19 [ 58858172, 58874214] - | 1 10 chr8 [ 18248755, 18258723] + | 10 100 chr20 [ 43248163, 43280376] - | 100 1000 chr18 [ 25530930, 25757445] - | 1000 10000 chr1 [243651535, 244006886] - | 10000 ... ... ... ... ... ... 9991 chr9 [114979995, 115095944] - | 9991 9992 chr21 [ 35736323, 35743440] + | 9992 9993 chr22 [ 19023795, 19109967] - | 9993 9994 chr6 [ 90539619, 90584155] + | 9994 9997 chr22 [ 50961997, 50964905] - | 9997 ------- seqinfo: 93 sequences (1 circular) from hg19 genome
Apart from genes, we can extract coordinates of transcripts(), exons(), cds(), and promoters(), using the homonymous function.
The output of these functions can be customized to include additional columns, e.g. you may want to print both the exon and gene id:
exons(TxDb.Hsapiens.UCSC.hg19.knownGene)
GRanges object with 289969 ranges and 1 metadata column: seqnames ranges strand | exon_id <Rle> <IRanges> <Rle> | <integer> [1] chr1 [11874, 12227] + | 1 [2] chr1 [12595, 12721] + | 2 [3] chr1 [12613, 12721] + | 3 [4] chr1 [12646, 12697] + | 4 [5] chr1 [13221, 14409] + | 5 ... ... ... ... ... ... [289965] chrY [27607404, 27607432] - | 277746 [289966] chrY [27635919, 27635954] - | 277747 [289967] chrY [59358329, 59359508] - | 277748 [289968] chrY [59360007, 59360115] - | 277749 [289969] chrY [59360501, 59360854] - | 277750 ------- seqinfo: 93 sequences (1 circular) from hg19 genome
exons(TxDb.Hsapiens.UCSC.hg19.knownGene, columns=c("exon_id", "gene_id"))
GRanges object with 289969 ranges and 2 metadata columns: seqnames ranges strand | exon_id gene_id <Rle> <IRanges> <Rle> | <integer> <CharacterList> [1] chr1 [11874, 12227] + | 1 100287102 [2] chr1 [12595, 12721] + | 2 100287102 [3] chr1 [12613, 12721] + | 3 100287102 [4] chr1 [12646, 12697] + | 4 100287102 [5] chr1 [13221, 14409] + | 5 100287102 ... ... ... ... ... ... ... [289965] chrY [27607404, 27607432] - | 277746 NA [289966] chrY [27635919, 27635954] - | 277747 NA [289967] chrY [59358329, 59359508] - | 277748 NA [289968] chrY [59360007, 59360115] - | 277749 NA [289969] chrY [59360501, 59360854] - | 277750 NA ------- seqinfo: 93 sequences (1 circular) from hg19 genome
To get the list of all possible additional columns, we can use the columns() function as for the org.Hs.eg.db package:
columns(TxDb.Hsapiens.UCSC.hg19.knownGene)
GenomicRanges object can be subsetted like normal data frames, using the subset function.
Chromosome names must be referred as "seqnames":
subset(exons(TxDb.Hsapiens.UCSC.hg19.knownGene),
seqnames=="chr6" & start>10000000 & end<20000000)
GRanges object with 569 ranges and 1 metadata column: seqnames ranges strand | exon_id <Rle> <IRanges> <Rle> | <integer> [1] chr6 [10412551, 10412600] + | 86074 [2] chr6 [10414300, 10415284] + | 86075 [3] chr6 [10414914, 10415284] + | 86076 [4] chr6 [10415520, 10415613] + | 86077 [5] chr6 [10415520, 10416402] + | 86078 ... ... ... ... ... ... [565] chr6 [19180383, 19180711] - | 93064 [566] chr6 [19642656, 19642760] - | 93065 [567] chr6 [19802395, 19804141] - | 93066 [568] chr6 [19804402, 19804508] - | 93067 [569] chr6 [19804906, 19804981] - | 93068 ------- seqinfo: 93 sequences (1 circular) from hg19 genome
BED and other formats (GFF, bigwig, etc..) can be imported using the import() function from rtracklayer.
Let's import a simple BED file:
library(rtracklayer)
marks = import("https://raw.githubusercontent.com/dalloliogm/belgrade_unix_intro/master/data/genes/example.bed",
format="bed")
marks
GRanges object with 2 ranges and 2 metadata columns: seqnames ranges strand | name score <Rle> <IRanges> <Rle> | <character> <numeric> [1] chr8 [145701383, 145702686] + | mark1 12 [2] chr20 [ 33850820, 33852880] - | mark2 52 ------- seqinfo: 2 sequences from an unspecified genome; no seqlengths
If you have trouble downloading the file, just copy these contents and save it to a file called example.bed:
chr8 145701382 145702686 mark1 12 +
chr20 33850819 33852880 mark2 52 -
The BED file just imported contains coordinates of two regions.
We can use the nearest() function to find the genes that are closer to these regions:
human.genes = genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
closest.genes = human.genes[nearest(marks, human.genes)]
closest.genes$mark = marks$name
closest.genes
GRanges object with 2 ranges and 2 metadata columns: seqnames ranges strand | gene_id mark <Rle> <IRanges> <Rle> | <character> <character> 90990 chr8 [145691738, 145699499] + | 90990 mark1 55741 chr20 [ 33703160, 33865960] - | 55741 mark2 ------- seqinfo: 93 sequences (1 circular) from hg19 genome
AnnotationHub is a recently introduced library, which allows to download data from several sources, directly into R:
library(AnnotationHub)
ahub = AnnotationHub()
ahub
AnnotationHub with 35306 records # snapshotDate(): 2016-08-15 # $dataprovider: BroadInstitute, UCSC, Ensembl, NCBI, Haemcode, Inparanoid8,... # $species: Homo sapiens, Mus musculus, Bos taurus, Pan troglodytes, Danio r... # $rdataclass: GRanges, FaFile, BigWigFile, OrgDb, ChainFile, Inparanoid8Db,... # additional mcols(): taxonomyid, genome, description, tags, sourceurl, # sourcetype # retrieve records with, e.g., 'object[["AH2"]]' title AH2 | Ailuropoda_melanoleuca.ailMel1.69.dna.toplevel.fa AH3 | Ailuropoda_melanoleuca.ailMel1.69.dna_rm.toplevel.fa AH4 | Ailuropoda_melanoleuca.ailMel1.69.dna_sm.toplevel.fa AH5 | Ailuropoda_melanoleuca.ailMel1.69.ncrna.fa AH6 | Ailuropoda_melanoleuca.ailMel1.69.pep.all.fa ... ... AH49436 | Xiphophorus_maculatus.Xipmac4.4.2.dna_rm.toplevel.fa AH49437 | Xiphophorus_maculatus.Xipmac4.4.2.dna_sm.toplevel.fa AH49438 | Xiphophorus_maculatus.Xipmac4.4.2.dna.toplevel.fa AH49439 | Xiphophorus_maculatus.Xipmac4.4.2.ncrna.fa AH49440 | Xiphophorus_maculatus.Xipmac4.4.2.pep.all.fa
For an overview of the data available via AnnotationHub, we can run the display function:
ahub = AnnotationHub()
display(ahub)
In alternative, we can use the subset function:
subset(ahub, species == "Homo sapiens" & dataprovider == "UCSC")
AnnotationHub with 5958 records # snapshotDate(): 2016-08-15 # $dataprovider: UCSC # $species: Homo sapiens # $rdataclass: GRanges, ChainFile, TwoBitFile # additional mcols(): taxonomyid, genome, description, tags, sourceurl, # sourcetype # retrieve records with, e.g., 'object[["AH5012"]]' title AH5012 | Chromosome Band AH5013 | STS Markers AH5014 | FISH Clones AH5015 | Recomb Rate AH5016 | ENCODE Pilot ... ... AH27618 | wgEncodeUwTfbsWerirb1CtcfStdPkRep2.narrowPeak.gz AH27619 | wgEncodeUwTfbsWi38CtcfStdHotspotsRep1.broadPeak.gz AH27620 | wgEncodeUwTfbsWi38CtcfStdHotspotsRep2.broadPeak.gz AH27621 | wgEncodeUwTfbsWi38CtcfStdPkRep1.narrowPeak.gz AH27622 | wgEncodeUwTfbsWi38CtcfStdPkRep2.narrowPeak.gz
For more complex queries, we can also use query():
query(subset(ahub, species == "Homo sapiens" & dataprovider == "UCSC"), "Most cons")
AnnotationHub with 3 records # snapshotDate(): 2016-08-15 # $dataprovider: UCSC # $species: Homo sapiens # $rdataclass: GRanges # additional mcols(): taxonomyid, genome, description, tags, sourceurl, # sourcetype # retrieve records with, e.g., 'object[["AH5224"]]' title AH5224 | 28-Way Most Cons AH5225 | 17-Way Most Cons AH5354 | Most Conserved
To download a dataset, we can use double square brackets, and the id of the dataset:
multi.24 = ahub[["AH5224"]]
multi.24
GRanges object with 2040420 ranges and 2 metadata columns: seqnames ranges strand | name score <Rle> <IRanges> <Rle> | <character> <numeric> [1] chr1 [25165820, 25165859] * | lod=22 299 [2] chr1 [28311541, 28311557] * | lod=14 244 [3] chr1 [29360118, 29360136] * | lod=14 244 [4] chr1 [71303152, 71303427] * | lod=462 674 [5] chr1 [88080377, 88080627] * | lod=485 680 ... ... ... ... ... ... ... [2040416] chr6_qbl_hap2 [4564430, 4564669] * | lod=40 373 [2040417] chr6_qbl_hap2 [4564824, 4564899] * | lod=26 320 [2040418] chr6_qbl_hap2 [4564975, 4565246] * | lod=93 477 [2040419] chr6_qbl_hap2 [4565271, 4565538] * | lod=102 488 [2040420] chr6_qbl_hap2 [4565616, 4565661] * | lod=15 252 ------- seqinfo: 49 sequences from hg18 genome
Now that we have conservation scores from the UCSC-Multiz table, we can get the scores for the marks bed file we imported earlier:
subsetByOverlaps(multi.24, marks)
GRanges object with 10 ranges and 2 metadata columns: seqnames ranges strand | name score <Rle> <IRanges> <Rle> | <character> <numeric> [1] chr8 [145701434, 145701456] * | lod=19 281 [2] chr8 [145701503, 145701510] * | lod=13 235 [3] chr8 [145701542, 145701585] * | lod=35 357 [4] chr8 [145701653, 145701678] * | lod=24 310 [5] chr8 [145702037, 145702121] * | lod=36 360 [6] chr8 [145702184, 145702262] * | lod=65 433 [7] chr8 [145702272, 145702329] * | lod=36 360 [8] chr8 [145702421, 145702512] * | lod=46 390 [9] chr20 [ 33852458, 33852471] * | lod=15 252 [10] chr20 [ 33852712, 33852970] * | lod=428 665 ------- seqinfo: 49 sequences from hg18 genome
Another useful function is mergeByOverlaps, which combines information from both ranges:
mergeByOverlaps(marks, multi.24)
DataFrame with 10 rows and 6 columns marks name score <GRanges> <character> <numeric> 1 chr8:+:[145701383, 145702686] mark1 12 2 chr8:+:[145701383, 145702686] mark1 12 3 chr8:+:[145701383, 145702686] mark1 12 4 chr8:+:[145701383, 145702686] mark1 12 5 chr8:+:[145701383, 145702686] mark1 12 6 chr8:+:[145701383, 145702686] mark1 12 7 chr8:+:[145701383, 145702686] mark1 12 8 chr8:+:[145701383, 145702686] mark1 12 9 chr20:-:[ 33850820, 33852880] mark2 52 10 chr20:-:[ 33850820, 33852880] mark2 52 multi.24 name.1 score.1 <GRanges> <character> <numeric> 1 chr8:*:[145701434, 145701456] lod=19 281 2 chr8:*:[145701503, 145701510] lod=13 235 3 chr8:*:[145701542, 145701585] lod=35 357 4 chr8:*:[145701653, 145701678] lod=24 310 5 chr8:*:[145702037, 145702121] lod=36 360 6 chr8:*:[145702184, 145702262] lod=65 433 7 chr8:*:[145702272, 145702329] lod=36 360 8 chr8:*:[145702421, 145702512] lod=46 390 9 chr20:*:[ 33852458, 33852471] lod=15 252 10 chr20:*:[ 33852712, 33852970] lod=428 665
This can be used to calculate the average conservation by region. In this case we will use the dplyr library for the aggregation:
library(dplyr)
select = AnnotationDbi::select
mergeByOverlaps(marks, multi.24) %>% as.data.frame %>% group_by(marks.seqnames) %>% summarise(mean=mean(score.1))
marks.seqnames | mean |
---|---|
chr8 | 340.75 |
chr20 | 458.50 |