We are going to re-create the workflow we did using the web interface using the R API.
The basic steps are:
Load the InterMine library. If it's not already installed, visit https://bioconductor.org/packages/release/bioc/html/InterMineR.html and follow the instructions to install.
library(InterMineR)
We want to query human data - so let's look and see what InterMines are available:
listMines()
Okay, let's select HumanMine from the list:
humanMine <- listMines()["HumanMine"] #select humanmine
humanMine #print out the value to see what's inside
Okay, now let's tell InterMineR that we want to use HumanMine for our queries.
Important: you'll need an API token for this part so you can access your HumanMine account. You can get your token by logging into HumanMine and going to the account details tab within MyMine. Cut and paste your token into the code below.
im <- initInterMine(mine=humanMine, "YOUR TOKEN HERE")
Our first query will be to select all human genes that are associate with diabetes. This will require two constraints:
Home sapiens
genes (HumanMine contains some non-human genes for homology query purposes)diabetes
.query1Diabetes <- setQuery(
# here we're choosing which columns of data we'd like to see
select = c("Gene.primaryIdentifier", "Gene.symbol"),
# set the logic for constraints. The first constraint is the first path+operator+value,
# e.g. Gene.organism.name = Homo sapiens, and the second constraint is the combination
# of the second path+operator+value, e.g. Gene.diseases.name CONTAINS diabetes
where = setConstraints(
paths = c("Gene.organism.name", "Gene.diseases.name"),
operators = c("=", "CONTAINS"),
values = list("Homo sapiens","diabetes")
)
)
Question to ponder: why did we use =
for our Homo sapiens constraint, but CONTAINS
for our diabetes constraint?
Anyway, we've set the query up, so now let's actually run it:
query1DiabetesResults <- runQuery(im,query1Diabetes)
# and let's print out the first few results to make sure it looks like we'd expect:
head(query1DiabetesResults)
Gene.primaryIdentifier | Gene.symbol |
---|---|
<chr> | <chr> |
100188782 | NIDDM4 |
1056 | CEL |
10644 | IGF2BP2 |
11132 | CAPN10 |
1234 | CCR5 |
1493 | CTLA4 |
This time we're creating another query, but with slightly more complex constraints. We're looking for genes that are in the public HumanMine list PL_Pax6_Targets
, that are also expressed in the pancreas at a High
or Medium
level.
We'll need a few more constraints than we did in Query 1:
Gene
s should be in the list PL_Pax6_Targets
Gene.proteinAtlasExpression.tissue.name
should be equal to Pancreas
Gene.proteinAtlasExpression.level
should be set to High
OR Medium
. This will require two constraints, one for each of medium and high.We'd also like to see a few more columns this time:
Gene
's primaryIdentifier
and symbol
Gene.proteinAtlasExpression.cellType
Gene.proteinAtlasExpression.level
Gene.proteinAtlasExpression.tissue.name
# We don't want to see *all* genes and their expression.
# Let's narrow it down a little by constraining it to genes that are of interest
query2UpInPancreasConstraint = setConstraints(
paths = c("Gene",
"Gene.proteinAtlasExpression.level",
"Gene.proteinAtlasExpression.level",
"Gene.proteinAtlasExpression.tissue.name"),
operators = c("IN", rep("=", 3)),
# each constraint is automatically given a code, allowing us to manipulate the
# logic for the constraint.
# So for us, constraints are set to codes A, B, C, D in order,
# e.g. Code A: "Gene" should be "IN" the list named "PL_DiabetesGenes"
# Code B: "Gene.proteinAtlasExpression.level" should be equal to "Medium"
# Code C: "Gene.proteinAtlasExpression.level" should be equal to "High"
# Code D: "Gene.proteinAtlasExpression.tissue.name" should be equal to Pancreas"
#
# Now, you might be thinking "how can the expression level be equal to both Medium
# AND High?" The answer is - it can't, but take a quick look at the constraintLogic
# we will set in the next code cell for an explanation
values = list("PL_Pax6_Targets", "Medium", "High", "Pancreas")
)
Excellent - we've defined the constraints we want. We still need to choose which columns to view.
# Create a new query
query2UpInPancreas = newQuery(
# Choose which columns of data we'd like to see
view = c("Gene.primaryIdentifier",
"Gene.symbol",
"Gene.proteinAtlasExpression.cellType",
"Gene.proteinAtlasExpression.level",
"Gene.proteinAtlasExpression.tissue.name"
),
# set the logic for constraints. This means our pancreas expression level
# is EITHER Medium (B) or High (C), but not both.
# --
# Note: Constraint logic only needs to be set if you wish to use OR. All other
# constraints have AND logic applied by default.
constraintLogic = "A and (B or C) and D"
)
# Add the constraint to our expressed pancreas query (previously we just _defined_ the constraint)
query2UpInPancreas$where <- query2UpInPancreasConstraint
Remember, that was just setting up the query - we haven't run it yet
# Now we have the query set up the way we want, let's actually *run* the query!
query2UpInPancreasResults <- runQuery(im = im, qry = query2UpInPancreas)
# Show me the first few results please!
head(query2UpInPancreasResults)
Gene.primaryIdentifier | Gene.symbol | Gene.proteinAtlasExpression.cellType | Gene.proteinAtlasExpression.level | Gene.proteinAtlasExpression.tissue.name |
---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> |
10097 | ACTR2 | exocrine glandular cells | Medium | Pancreas |
10097 | ACTR2 | islets of Langerhans | Medium | Pancreas |
10196 | PRMT3 | exocrine glandular cells | Medium | Pancreas |
10196 | PRMT3 | islets of Langerhans | Medium | Pancreas |
1121 | CHM | exocrine glandular cells | Medium | Pancreas |
1121 | CHM | islets of Langerhans | Medium | Pancreas |
Let's check which genes are in BOTH lists that we've created. To do this, we'll strip down the columns we have to just primary identifiers, and then run a list intersect function.
# Extract the primaryIdentifier columns from query1 (diabetes genes) and query 2 (upexpressed in pancreas)
primaryIdentifiers.diabetes <- query1DiabetesResults[["Gene.primaryIdentifier"]]
primaryIdentifiers.pancreas <- query2UpInPancreasResults[["Gene.primaryIdentifier"]]
# Find the intersection of the two lists of primary identifiers
diabetesAndPancreasGenes <- intersect(primaryIdentifiers.diabetes,primaryIdentifiers.pancreas)
# Show the results
print(diabetesAndPancreasGenes)
[1] "3172" "6928" "6934"
Finally, we fed the intersected list from above back into another query to see if there was any association of these genes with diabetes phenotypes according to GWAS studies. Note that we now start our query from the GWAS
class:
# First, we set up the constraints. The last three constraints are the
# diabetesAndPancreas result genes from our last query.
query3GWASConstraints <- setConstraints(
paths = c("GWAS.results.pValue",
"GWAS.results.phenotype",
# using rep so we don't have to type this three times...
rep("GWAS.results.associatedGenes.primaryIdentifier",3)
),
operators = c("<=",
"CONTAINS",
rep("=",3)),
values = list("1e-04", #A
"diabetes",#B
"3172", #C
"6928", #D
"6934") #E
)
Now we've set our constraints up nicely, let's choose which columns we want to view.
query3GWAS <- newQuery(
# Quite a few columns this time!
view = c("GWAS.results.associatedGenes.primaryIdentifier",
"GWAS.results.associatedGenes.symbol", "GWAS.results.associatedGenes.name",
"GWAS.results.SNP.primaryIdentifier", "GWAS.results.pValue", "GWAS.results.phenotype",
"GWAS.firstAuthor", "GWAS.name", "GWAS.publication.pubMedId",
"GWAS.results.associatedGenes.organism.shortName"),
# set the logic for constraints. Remember that we want our results
# to include any one of the three genes we found in the list of diabetes+pancreas genes
# so we need to use some OR logic.
constraintLogic = "A and B and (C or D or E)"
)
Add the constraints to the query, and then run it...
#add constraint
query3GWAS$where <- query3GWASConstraints
#run query
query3GWASResults <- runQuery(im, query3GWAS)
Now, let's view those results...
query3GWASResults
GWAS.results.associatedGenes.primaryIdentifier | GWAS.results.associatedGenes.symbol | GWAS.results.associatedGenes.name | GWAS.results.SNP.primaryIdentifier | GWAS.results.pValue | GWAS.results.phenotype | GWAS.firstAuthor | GWAS.name | GWAS.publication.pubMedId | GWAS.results.associatedGenes.organism.shortName |
---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
3172 | HNF4A | hepatocyte nuclear factor 4 alpha | rs4812829 | 3.0E-10 | Type 2 diabetes | Kooner JS | Genome-wide association study in individuals of South Asian ancestry identifies six new type 2 diabetes susceptibility loci. | 21874001 | H. sapiens |
3172 | HNF4A | hepatocyte nuclear factor 4 alpha | rs4812829 | 4.0E-10 | Type 2 diabetes | Zhao W | Identification of new susceptibility loci for type 2 diabetes and shared etiological pathways with coronary heart disease. | 28869590 | H. sapiens |
3172 | HNF4A | hepatocyte nuclear factor 4 alpha | rs4812829 | 5.0E-8 | Type 2 diabetes | Mahajan A | Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. | 24509480 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs117229942 | 4.0E-11 | Type 2 diabetes | Xue A | Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. | 30054458 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs34872471 | 1.0E-94 | Type 2 diabetes | Bonas-Guarch S | Re-analysis of public genetic data reveals a rare X-chromosomal variant associated with type 2 diabetes. | 29358691 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs34872471 | 6.0E-53 | Type 2 diabetes | Cook JP | Multi-ethnic genome-wide association study identifies novel locus for type 2 diabetes susceptibility. | 27189021 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs34872471 | 3.0E-23 | Type 2 diabetes | Imamura M | Genome-wide association studies in the Japanese population identify seven novel loci for type 2 diabetes. | 26818947 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs34872471 | 8.0E-8 | Type 2 diabetes | Ghassibe-Sabbagh M | T2DM GWAS in the Lebanese population confirms the role of TCF7L2 and CDKAL1 in disease susceptibility. | 25483131 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs4506565 | 5.0E-12 | Type 2 diabetes | Wellcome Trust Case Control Consortium | Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. | 17554300 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs4918796 | 4.0E-13 | Type 2 diabetes | Xue A | Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. | 30054458 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7901695 | 1.0E-48 | Type 2 diabetes | Zeggini E | Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. | 17463249 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7901695 | 1.0E-6 | Type 2 diabetes | Lettre G | Genome-wide association study of coronary heart disease and its risk factors in 8,090 African Americans: the NHLBI CARe Project. | 21347282 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 0.0 | Type 2 diabetes | Xue A | Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. | 30054458 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 1.0E-219 | Type 2 diabetes | Zhao W | Identification of new susceptibility loci for type 2 diabetes and shared etiological pathways with coronary heart disease. | 28869590 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 2.0E-184 | Type 2 diabetes | Zhao W | Identification of new susceptibility loci for type 2 diabetes and shared etiological pathways with coronary heart disease. | 28869590 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 1.0E-139 | Type 2 diabetes | Morris AP | Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. | 22885922 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 4.0E-94 | Type 2 diabetes | Ng MC | Meta-analysis of genome-wide association studies in African Americans provides insights into the genetic architecture of type 2 diabetes. | 25102180 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 6.0E-92 | Type 2 diabetes | Morris AP | Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. | 22885922 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 8.0E-75 | Type 2 diabetes | Mahajan A | Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. | 24509480 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 9.0E-75 | Type 2 diabetes | Saxena R | Genome-wide association study identifies a novel locus contributing to type 2 diabetes susceptibility in Sikhs of Punjabi origin from India. | 23300278 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 4.0E-73 | Type 2 diabetes | Morris AP | Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. | 22885922 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 2.0E-51 | Type 2 diabetes | Voight BF | Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. | 20581827 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 1.0E-48 | Type 2 diabetes | Saxena R | Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. | 17463246 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 7.0E-45 | Type 2 diabetes | Wood AR | Variants in the FTO and CDKAL1 loci have recessive effects on risk of obesity and type 2 diabetes, respectively. | 26961502 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 5.0E-44 | Type 2 diabetes | Ng MC | Meta-analysis of genome-wide association studies in African Americans provides insights into the genetic architecture of type 2 diabetes. | 25102180 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 2.0E-40 | Type 2 diabetes | Perry JR | Stratifying type 2 diabetes cases by BMI identifies genetic risk variants in LAMA1 and enrichment for risk variants in lean compared to obese cases. | 22693455 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 2.0E-38 | Type 2 diabetes | Saxena R | Genome-wide association study identifies a novel locus contributing to type 2 diabetes susceptibility in Sikhs of Punjabi origin from India. | 23300278 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 1.0E-35 | Type 2 diabetes | Tabassum R | Genome-wide association study for type 2 diabetes in Indians identifies a new susceptibility locus at 2q21. | 23209189 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 3.0E-35 | Type 2 diabetes | Saxena R | Genome-wide association study identifies a novel locus contributing to type 2 diabetes susceptibility in Sikhs of Punjabi origin from India. | 23300278 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 2.0E-34 | Type 2 diabetes | Sladek R | A genome-wide association study identifies novel risk loci for type 2 diabetes. | 17293876 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 1.0E-30 | Type 2 diabetes and other traits | Rung J | Genetic variant near IRS1 is associated with type 2 diabetes, insulin resistance and hyperinsulinemia. | 19734900 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 9.0E-30 | Type 2 diabetes | Timpson NJ | Adiposity-related heterogeneity in patterns of type 2 diabetes susceptibility observed in genome-wide association data. | 19056611 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 3.0E-23 | Type 2 diabetes | Zeggini E | Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. | 18372903 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 3.0E-23 | Type 2 diabetes | Zhao W | Identification of new susceptibility loci for type 2 diabetes and shared etiological pathways with coronary heart disease. | 28869590 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 6.0E-22 | Type 2 diabetes | Saxena R | Genome-wide association study identifies a novel locus contributing to type 2 diabetes susceptibility in Sikhs of Punjabi origin from India. | 23300278 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 4.0E-21 | Type 2 diabetes | Perry JR | Stratifying type 2 diabetes cases by BMI identifies genetic risk variants in LAMA1 and enrichment for risk variants in lean compared to obese cases. | 22693455 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 3.0E-19 | Type 2 diabetes | Saxena R | Genome-wide association study identifies a novel locus contributing to type 2 diabetes susceptibility in Sikhs of Punjabi origin from India. | 23300278 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 6.0E-16 | Type 2 diabetes | Timpson NJ | Adiposity-related heterogeneity in patterns of type 2 diabetes susceptibility observed in genome-wide association data. | 19056611 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 2.0E-15 | Type 2 diabetes | Hara K | Genome-wide association study identifies three novel loci for type 2 diabetes. | 23945395 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 2.0E-15 | Type 2 diabetes | Kho AN | Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study. | 22101970 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 1.0E-14 | Type 2 diabetes | Williams AL | Sequence variants in SLC16A11 are a common risk factor for type 2 diabetes in Mexico. | 24390345 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 8.0E-12 | Type 2 diabetes | Takeuchi F | Confirmation of multiple risk Loci and genetic impacts by a genome-wide association study of type 2 diabetes in the Japanese population. | 19401414 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 1.0E-11 | Type 2 diabetes | Qi Q | Genetics of Type 2 Diabetes in U.S. Hispanic/Latino Individuals: Results From the Hispanic Community Health Study/Study of Latinos (HCHS/SOL). | 28254843 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 2.0E-10 | Type 2 diabetes | Steinthorsdottir V | A variant in CDKAL1 influences insulin response and risk of type 2 diabetes. | 17460697 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 1.0E-8 | Type 2 diabetes | Scott LJ | A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. | 17463248 | H. sapiens |
6934 | TCF7L2 | transcription factor 7 like 2 | rs7903146 | 5.0E-8 | Type 2 diabetes | Salonen JT | Type 2 diabetes whole-genome association study in four populations: the DiaGen consortium. | 17668382 | H. sapiens |
And let's take a look at the unique gene symbols that were returned:
GWASIds <- query3GWASResults["GWAS.results.associatedGenes.symbol"]
unique(GWASIds)
GWAS.results.associatedGenes.symbol | |
---|---|
<chr> | |
1 | HNF4A |
4 | TCF7L2 |