InterMineR Workshop Use Case

We are going to re-create the workflow we did using the web interface using the R API.

The basic steps are:

  1. Load the InterMine library and choose an InterMine to query.
  2. Query 1: Diabetes Genes: Fetch a list of genes that are associated with diabetes
  3. Query 2: PAX6 + Pancreas: Fetch a list of genes that have medium or high expression in the pancreas and are in our PAX6 targets list
  4. Intersection: Find which genes are present in both Query 1 and Query2.
  5. GWAS: Compare the intersection of the previous query with results from GWAS studies.

Getting started - Load InterMineR and choose an InterMine

Load the InterMine library. If it's not already installed, visit https://bioconductor.org/packages/release/bioc/html/InterMineR.html and follow the instructions to install.

In [1]:
library(InterMineR)

We want to query human data - so let's look and see what InterMines are available:

In [2]:
listMines()
BMAP
'https://bmap.jgi.doe.gov/bmapmine/'
BeanMine
'https://mines.legumeinfo.org/beanmine'
BovineMine
'http://genomes.missouri.edu/bovinemine'
CHOmine
'https://chomine.boku.ac.at/chomine'
ChickpeaMine
'https://mines.legumeinfo.org/chickpeamine'
CowpeaMine
'https://mines.legumeinfo.org/cowpeamine'
FlyMine
'http://www.flymine.org/flymine'
GrapeMine
'http://urgi.versailles.inra.fr/GrapeMine'
HumanMine
'http://www.humanmine.org/humanmine'
HymenopteraMine
'http://hymenopteragenome.org/hymenopteramine'
IndigoMine
'http://www.cbrc.kaust.edu.sa/indigo'
LegumeMine
'https://mines.legumeinfo.org/legumemine'
MaizeMine
'http://maizemine.rnet.missouri.edu:8080/maizemine'
MedicMine
'http://medicmine.jcvi.org/medicmine'
MitoMiner
'http://mitominer.mrc-mbu.cam.ac.uk/release-4.0'
ModMine
'http://intermine.modencode.org/release-33'
MouseMine
'http://www.mousemine.org/mousemine'
OakMine
'https://urgi.versailles.inra.fr/OakMine_PM1N'
PeanutMine
'https://mines.legumeinfo.org/peanutmine'
PhytoMine
'https://phytozome.jgi.doe.gov/phytomine/'
PlanMine
'http://planmine.mpi-cbg.de/planmine'
RatMine
'http://ratmine.mcw.edu/ratmine'
RepetDB
'http://urgi.versailles.inra.fr/repetdb'
SoyMine
'https://mines.legumeinfo.org/soymine'
TargetMine
'https://targetmine.mizuguchilab.org/targetmine'
ThaleMine
'https://apps.araport.org/thalemine'
WheatMine
'https://urgi.versailles.inra.fr/WheatMine'
WormMine
'http://intermine.wormbase.org/tools/wormmine/'
XenMine
'http://www.xenmine.org/xenmine'
YeastMine
'https://yeastmine.yeastgenome.org/yeastmine'
ZebrafishMine
'http://zebrafishmine.org'

Okay, let's select HumanMine from the list:

In [3]:
humanMine <- listMines()["HumanMine"] #select humanmine
humanMine #print out the value to see what's inside
HumanMine: 'http://www.humanmine.org/humanmine'

Okay, now let's tell InterMineR that we want to use HumanMine for our queries.

Important: you'll need an API token for this part so you can access your HumanMine account. You can get your token by logging into HumanMine and going to the account details tab within MyMine. Cut and paste your token into the code below.

In [4]:
im <- initInterMine(mine=humanMine, "YOUR TOKEN HERE")

First Query: Diabetes Genes

Our first query will be to select all human genes that are associate with diabetes. This will require two constraints:

  1. Ensure all genes returned are Home sapiens genes (HumanMine contains some non-human genes for homology query purposes)
  2. Restrict results to genes that are associated with diabetes.
In [18]:
query1Diabetes <- setQuery( 
  # here we're choosing which columns of data we'd like to see
  select = c("Gene.primaryIdentifier", "Gene.symbol"),
    # set the logic for constraints. The first constraint is the first path+operator+value, 
    # e.g. Gene.organism.name = Homo sapiens, and the second constraint is the combination 
    # of the second path+operator+value, e.g. Gene.diseases.name CONTAINS diabetes
  where = setConstraints(
    paths = c("Gene.organism.name", "Gene.diseases.name"),
    operators = c("=", "CONTAINS"),
    values = list("Homo sapiens","diabetes")
  )
)

Question to ponder: why did we use = for our Homo sapiens constraint, but CONTAINS for our diabetes constraint?

Anyway, we've set the query up, so now let's actually run it:

In [19]:
query1DiabetesResults <- runQuery(im,query1Diabetes)

# and let's print out the first few results to make sure it looks like we'd expect:
head(query1DiabetesResults)
A data.frame: 6 × 2
Gene.primaryIdentifierGene.symbol
<chr><chr>
100188782NIDDM4
1056 CEL
10644 IGF2BP2
11132 CAPN10
1234 CCR5
1493 CTLA4

Query 2: Pax6 targets that have high expression in the Pancreas

This time we're creating another query, but with slightly more complex constraints. We're looking for genes that are in the public HumanMine list PL_Pax6_Targets, that are also expressed in the pancreas at a High or Medium level.

We'll need a few more constraints than we did in Query 1:

  1. all Genes should be in the list PL_Pax6_Targets
  2. Gene.proteinAtlasExpression.tissue.name should be equal to Pancreas
  3. Gene.proteinAtlasExpression.level should be set to High OR Medium. This will require two constraints, one for each of medium and high.

We'd also like to see a few more columns this time:

  1. The Gene's primaryIdentifier and symbol
  2. The following expression data from Protein Atlas:
    • Gene.proteinAtlasExpression.cellType
    • Gene.proteinAtlasExpression.level
    • Gene.proteinAtlasExpression.tissue.name
In [20]:
# We don't want to see *all* genes and their expression. 
# Let's narrow it down a little by constraining it to genes that are of interest
query2UpInPancreasConstraint = setConstraints(
  paths = c("Gene", 
            "Gene.proteinAtlasExpression.level", 
            "Gene.proteinAtlasExpression.level", 
            "Gene.proteinAtlasExpression.tissue.name"),
  operators = c("IN", rep("=", 3)),
  # each constraint is automatically given a code, allowing us to manipulate the 
  # logic for the constraint. 
  #  So for us, constraints are set to codes A, B, C, D in order, 
  #  e.g. Code A: "Gene" should be "IN" the list named "PL_DiabetesGenes"
  #       Code B: "Gene.proteinAtlasExpression.level" should be equal to "Medium"
  #       Code C: "Gene.proteinAtlasExpression.level" should be equal to "High"
  #       Code D: "Gene.proteinAtlasExpression.tissue.name" should be equal to Pancreas"
  # 
  # Now, you might be thinking "how can the expression level be equal to both Medium
  # AND High?" The answer is - it can't, but take a quick look at the constraintLogic
  # we will set in the next code cell for an explanation
  values = list("PL_Pax6_Targets", "Medium", "High", "Pancreas")
)

Excellent - we've defined the constraints we want. We still need to choose which columns to view.

In [21]:
# Create a new query
query2UpInPancreas = newQuery(
  # Choose which columns of data we'd like to see
  view = c("Gene.primaryIdentifier",
             "Gene.symbol",
             "Gene.proteinAtlasExpression.cellType",
             "Gene.proteinAtlasExpression.level",
             "Gene.proteinAtlasExpression.tissue.name"
  ),
  # set the logic for constraints. This means our pancreas expression level 
  # is EITHER Medium (B) or High (C), but not both.
  # --
  # Note: Constraint logic only needs to be set if you wish to use OR. All other
  # constraints have AND logic applied by default. 
  constraintLogic = "A and (B or C) and D"
)

# Add the constraint to our expressed pancreas query (previously we just _defined_ the constraint)
query2UpInPancreas$where <- query2UpInPancreasConstraint

Remember, that was just setting up the query - we haven't run it yet

In [22]:
# Now we have the query set up the way we want, let's actually *run* the query! 
query2UpInPancreasResults <-  runQuery(im = im, qry = query2UpInPancreas)

# Show me the first few results please! 
head(query2UpInPancreasResults)
A data.frame: 6 × 5
Gene.primaryIdentifierGene.symbolGene.proteinAtlasExpression.cellTypeGene.proteinAtlasExpression.levelGene.proteinAtlasExpression.tissue.name
<chr><chr><chr><chr><chr>
10097ACTR2exocrine glandular cellsMediumPancreas
10097ACTR2islets of Langerhans MediumPancreas
10196PRMT3exocrine glandular cellsMediumPancreas
10196PRMT3islets of Langerhans MediumPancreas
1121 CHM exocrine glandular cellsMediumPancreas
1121 CHM islets of Langerhans MediumPancreas

Intersection: Which genes overlap in Query1 and Query2?

Let's check which genes are in BOTH lists that we've created. To do this, we'll strip down the columns we have to just primary identifiers, and then run a list intersect function.

In [23]:
# Extract the primaryIdentifier columns from query1 (diabetes genes) and query 2 (upexpressed in pancreas)
primaryIdentifiers.diabetes <- query1DiabetesResults[["Gene.primaryIdentifier"]]
primaryIdentifiers.pancreas <- query2UpInPancreasResults[["Gene.primaryIdentifier"]]

# Find the intersection of the two lists of primary identifiers
diabetesAndPancreasGenes <- intersect(primaryIdentifiers.diabetes,primaryIdentifiers.pancreas)

# Show the results
print(diabetesAndPancreasGenes)
[1] "3172" "6928" "6934"

GWAS: Compare the intersection above with results from GWAS studies

Finally, we fed the intersected list from above back into another query to see if there was any association of these genes with diabetes phenotypes according to GWAS studies. Note that we now start our query from the GWAS class:

In [24]:
# First, we set up the constraints. The last three constraints are the 
# diabetesAndPancreas result genes from our last query.
query3GWASConstraints <- setConstraints(
    paths = c("GWAS.results.pValue", 
              "GWAS.results.phenotype",
              # using rep so we don't have to type this three times... 
              rep("GWAS.results.associatedGenes.primaryIdentifier",3)
             ),
    operators = c("<=", 
                  "CONTAINS",
                  rep("=",3)),
    values = list("1e-04",   #A
                  "diabetes",#B 
                  "3172",    #C
                  "6928",    #D
                  "6934")    #E
  )

Now we've set our constraints up nicely, let's choose which columns we want to view.

In [25]:
query3GWAS <- newQuery( 
  # Quite a few columns this time!
  view = c("GWAS.results.associatedGenes.primaryIdentifier",
    "GWAS.results.associatedGenes.symbol", "GWAS.results.associatedGenes.name",
    "GWAS.results.SNP.primaryIdentifier", "GWAS.results.pValue", "GWAS.results.phenotype",
    "GWAS.firstAuthor", "GWAS.name", "GWAS.publication.pubMedId",
    "GWAS.results.associatedGenes.organism.shortName"),
    # set the logic for constraints. Remember that we want our results
    # to include any one of the three genes we found in the list of diabetes+pancreas genes
    # so we need to use some OR logic.
  constraintLogic = "A and B and (C or D or E)"
)

Add the constraints to the query, and then run it...

In [27]:
#add constraint
query3GWAS$where <- query3GWASConstraints
#run query
query3GWASResults <- runQuery(im, query3GWAS)

Now, let's view those results...

In [30]:
query3GWASResults
A data.frame: 46 × 10
GWAS.results.associatedGenes.primaryIdentifierGWAS.results.associatedGenes.symbolGWAS.results.associatedGenes.nameGWAS.results.SNP.primaryIdentifierGWAS.results.pValueGWAS.results.phenotypeGWAS.firstAuthorGWAS.nameGWAS.publication.pubMedIdGWAS.results.associatedGenes.organism.shortName
<chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
3172HNF4A hepatocyte nuclear factor 4 alphars4812829 3.0E-10 Type 2 diabetes Kooner JS Genome-wide association study in individuals of South Asian ancestry identifies six new type 2 diabetes susceptibility loci. 21874001H. sapiens
3172HNF4A hepatocyte nuclear factor 4 alphars4812829 4.0E-10 Type 2 diabetes Zhao W Identification of new susceptibility loci for type 2 diabetes and shared etiological pathways with coronary heart disease. 28869590H. sapiens
3172HNF4A hepatocyte nuclear factor 4 alphars4812829 5.0E-8 Type 2 diabetes Mahajan A Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. 24509480H. sapiens
6934TCF7L2transcription factor 7 like 2 rs1172299424.0E-11 Type 2 diabetes Xue A Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. 30054458H. sapiens
6934TCF7L2transcription factor 7 like 2 rs34872471 1.0E-94 Type 2 diabetes Bonas-Guarch S Re-analysis of public genetic data reveals a rare X-chromosomal variant associated with type 2 diabetes. 29358691H. sapiens
6934TCF7L2transcription factor 7 like 2 rs34872471 6.0E-53 Type 2 diabetes Cook JP Multi-ethnic genome-wide association study identifies novel locus for type 2 diabetes susceptibility. 27189021H. sapiens
6934TCF7L2transcription factor 7 like 2 rs34872471 3.0E-23 Type 2 diabetes Imamura M Genome-wide association studies in the Japanese population identify seven novel loci for type 2 diabetes. 26818947H. sapiens
6934TCF7L2transcription factor 7 like 2 rs34872471 8.0E-8 Type 2 diabetes Ghassibe-Sabbagh M T2DM GWAS in the Lebanese population confirms the role of TCF7L2 and CDKAL1 in disease susceptibility. 25483131H. sapiens
6934TCF7L2transcription factor 7 like 2 rs4506565 5.0E-12 Type 2 diabetes Wellcome Trust Case Control ConsortiumGenome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. 17554300H. sapiens
6934TCF7L2transcription factor 7 like 2 rs4918796 4.0E-13 Type 2 diabetes Xue A Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. 30054458H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7901695 1.0E-48 Type 2 diabetes Zeggini E Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. 17463249H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7901695 1.0E-6 Type 2 diabetes Lettre G Genome-wide association study of coronary heart disease and its risk factors in 8,090 African Americans: the NHLBI CARe Project. 21347282H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 0.0 Type 2 diabetes Xue A Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. 30054458H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 1.0E-219Type 2 diabetes Zhao W Identification of new susceptibility loci for type 2 diabetes and shared etiological pathways with coronary heart disease. 28869590H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 2.0E-184Type 2 diabetes Zhao W Identification of new susceptibility loci for type 2 diabetes and shared etiological pathways with coronary heart disease. 28869590H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 1.0E-139Type 2 diabetes Morris AP Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. 22885922H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 4.0E-94 Type 2 diabetes Ng MC Meta-analysis of genome-wide association studies in African Americans provides insights into the genetic architecture of type 2 diabetes. 25102180H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 6.0E-92 Type 2 diabetes Morris AP Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. 22885922H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 8.0E-75 Type 2 diabetes Mahajan A Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. 24509480H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 9.0E-75 Type 2 diabetes Saxena R Genome-wide association study identifies a novel locus contributing to type 2 diabetes susceptibility in Sikhs of Punjabi origin from India. 23300278H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 4.0E-73 Type 2 diabetes Morris AP Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. 22885922H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 2.0E-51 Type 2 diabetes Voight BF Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. 20581827H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 1.0E-48 Type 2 diabetes Saxena R Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. 17463246H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 7.0E-45 Type 2 diabetes Wood AR Variants in the FTO and CDKAL1 loci have recessive effects on risk of obesity and type 2 diabetes, respectively. 26961502H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 5.0E-44 Type 2 diabetes Ng MC Meta-analysis of genome-wide association studies in African Americans provides insights into the genetic architecture of type 2 diabetes. 25102180H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 2.0E-40 Type 2 diabetes Perry JR Stratifying type 2 diabetes cases by BMI identifies genetic risk variants in LAMA1 and enrichment for risk variants in lean compared to obese cases.22693455H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 2.0E-38 Type 2 diabetes Saxena R Genome-wide association study identifies a novel locus contributing to type 2 diabetes susceptibility in Sikhs of Punjabi origin from India. 23300278H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 1.0E-35 Type 2 diabetes Tabassum R Genome-wide association study for type 2 diabetes in Indians identifies a new susceptibility locus at 2q21. 23209189H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 3.0E-35 Type 2 diabetes Saxena R Genome-wide association study identifies a novel locus contributing to type 2 diabetes susceptibility in Sikhs of Punjabi origin from India. 23300278H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 2.0E-34 Type 2 diabetes Sladek R A genome-wide association study identifies novel risk loci for type 2 diabetes. 17293876H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 1.0E-30 Type 2 diabetes and other traitsRung J Genetic variant near IRS1 is associated with type 2 diabetes, insulin resistance and hyperinsulinemia. 19734900H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 9.0E-30 Type 2 diabetes Timpson NJ Adiposity-related heterogeneity in patterns of type 2 diabetes susceptibility observed in genome-wide association data. 19056611H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 3.0E-23 Type 2 diabetes Zeggini E Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. 18372903H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 3.0E-23 Type 2 diabetes Zhao W Identification of new susceptibility loci for type 2 diabetes and shared etiological pathways with coronary heart disease. 28869590H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 6.0E-22 Type 2 diabetes Saxena R Genome-wide association study identifies a novel locus contributing to type 2 diabetes susceptibility in Sikhs of Punjabi origin from India. 23300278H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 4.0E-21 Type 2 diabetes Perry JR Stratifying type 2 diabetes cases by BMI identifies genetic risk variants in LAMA1 and enrichment for risk variants in lean compared to obese cases.22693455H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 3.0E-19 Type 2 diabetes Saxena R Genome-wide association study identifies a novel locus contributing to type 2 diabetes susceptibility in Sikhs of Punjabi origin from India. 23300278H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 6.0E-16 Type 2 diabetes Timpson NJ Adiposity-related heterogeneity in patterns of type 2 diabetes susceptibility observed in genome-wide association data. 19056611H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 2.0E-15 Type 2 diabetes Hara K Genome-wide association study identifies three novel loci for type 2 diabetes. 23945395H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 2.0E-15 Type 2 diabetes Kho AN Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study. 22101970H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 1.0E-14 Type 2 diabetes Williams AL Sequence variants in SLC16A11 are a common risk factor for type 2 diabetes in Mexico. 24390345H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 8.0E-12 Type 2 diabetes Takeuchi F Confirmation of multiple risk Loci and genetic impacts by a genome-wide association study of type 2 diabetes in the Japanese population. 19401414H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 1.0E-11 Type 2 diabetes Qi Q Genetics of Type 2 Diabetes in U.S. Hispanic/Latino Individuals: Results From the Hispanic Community Health Study/Study of Latinos (HCHS/SOL). 28254843H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 2.0E-10 Type 2 diabetes Steinthorsdottir V A variant in CDKAL1 influences insulin response and risk of type 2 diabetes. 17460697H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 1.0E-8 Type 2 diabetes Scott LJ A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. 17463248H. sapiens
6934TCF7L2transcription factor 7 like 2 rs7903146 5.0E-8 Type 2 diabetes Salonen JT Type 2 diabetes whole-genome association study in four populations: the DiaGen consortium. 17668382H. sapiens

And let's take a look at the unique gene symbols that were returned:

In [33]:
GWASIds <- query3GWASResults["GWAS.results.associatedGenes.symbol"]
unique(GWASIds)
A data.frame: 2 × 1
GWAS.results.associatedGenes.symbol
<chr>
1HNF4A
4TCF7L2