Notebook

InterMineR Workshop Use Case¶

We are going to re-create the workflow we did using the web interface using the R API.

The basic steps are:

Load the InterMine library and choose an InterMine to query.
Query 1: Diabetes Genes: Fetch a list of genes that are associated with diabetes
Query 2: PAX6 + Pancreas: Fetch a list of genes that have medium or high expression in the pancreas and are in our PAX6 targets list
Intersection: Find which genes are present in both Query 1 and Query2.
GWAS: Compare the intersection of the previous query with results from GWAS studies.

Getting started - Load InterMineR and choose an InterMine¶

Load the InterMine library. If it's not already installed, visit https://bioconductor.org/packages/release/bioc/html/InterMineR.html and follow the instructions to install.

In [ ]:

library(InterMineR)

We want to query human data - so let's look and see what InterMines are available.
The method we want to use is listMines()

In [ ]:

listMines()

Okay, let's select HumanMine from the list, and store it in a variable called humanMine

In [ ]:

Okay, now let's tell InterMineR that we want to use HumanMine for our queries. Store it in a variable called im.

Important: you'll need an API token for this part so you can access your HumanMine account. You can get your token by logging into HumanMine and going to the account details tab within MyMine. Cut and paste your token into the code below.

In [ ]:

# the syntax to initialise an InterMine is as follows:
# im <- initInterMine(mine=SOME-INTERMINE, "YOUR TOKEN HERE")
im <-

First Query: Diabetes Genes¶

Our first query will be to select all human genes that are associate with diabetes.

This will require two constraints:

Ensure all genes are human genes. Make sure Gene.organism.name is equal (use the = operator) to Homo sapiens. (HumanMine contains some non-human genes for homology query purposes)
Restrict results to genes that are associated with diabetes. This constraint will be asking for Gene.diseases.name to contain (use the operator CONTAINS) the string "diabetes"

We'll also select two columns to view:

Gene.primaryIdentifier
Gene.symbol

Store the first query in a variable valled query1Diabetes.

In [ ]:

# We'll be using the setQuery method here, with two arguments: select and where.
# 
# Syntax: 
# setQuery( 
#  select = c("VIEW.NAME1", "VIEW.NAME2", .... ),
#  where = setConstraints(
#    paths = c("CONSTRAINT.PATH.A", "CONSTRAINT.PATH.B"),
#    operators = c("CONSTRAINT.OPERATOR.A", "CONSTRAINT.OPERATOR.B"),
#    values = list("CONSTRAINT.VALUE.A","CONSTRAINT.VALUE.B")
#  )
#)


query1Diabetes <- setQuery( 
  # here we're choosing which columns of data we'd like to see
  select = c(  ),
    # set the logic for constraints. The first constraint is the first path+operator+value, 
    # e.g. Gene.organism.name = Homo sapiens, and the second constraint is the combination 
    # of the second path+operator+value, e.g. Gene.diseases.name CONTAINS diabetes
  where = setConstraints(
    paths = c(  ),
    operators = c(  ),
    values = list(  )
  )
)

Question to ponder: why did we use = for our Homo sapiens constraint, but CONTAINS for our diabetes constraint?

Anyway, we've set the query up, so now let's actually run it. Use the method runQuery and store the results in a variable called query1DiabetesResults

In [ ]:

# Syntax: runQuery(im,QUERY_TO_RUN)

query1DiabetesResults <- #put the code to run the query here

# and let's print out the first few results to make sure it looks like we'd expect:
# Syntax: head(THING_TO_PRINT)

Query 2: Pax6 targets that have high expression in the Pancreas¶

This time we're creating another query, but with slightly more complex constraints. We're looking for genes that are in the public HumanMine list PL_Pax6_Targets, that are also expressed in the pancreas at a High or Medium level.

We'll need a few more constraints than we did in Query 1:

all Genes should be IN the list PL_Pax6_Targets
Gene.proteinAtlasExpression.level should be set to High OR Medium. This will require two constraints, one for each of medium and high.
Gene.proteinAtlasExpression.tissue.name should be equal to Pancreas

We'd also like to see a few more columns this time:

The Gene's primaryIdentifier and symbol
The following expression data from Protein Atlas:
- Gene.proteinAtlasExpression.cellType
- Gene.proteinAtlasExpression.level
- Gene.proteinAtlasExpression.tissue.name

Store it in a variable called query2UpInPancreas

In [ ]:

# We don't want to see *all* genes and their expression. 
# Let's narrow it down a little by constraining it to genes that are 
# of interest, using setConstraints. 
# Syntax: 
# 
# setConstraints(
#    paths = c("CONSTRAINT.PATH.A", "CONSTRAINT.PATH.B"),
#    operators = c("CONSTRAINT.OPERATOR.A", "CONSTRAINT.OPERATOR.B"),
#    values = list("CONSTRAINT.VALUE.A","CONSTRAINT.VALUE.B")
# )

query2UpInPancreasConstraint <- setConstraints(

)

Excellent - we've defined the constraints we want. We still need to choose which columns to view.

In [ ]:

# Create a new query.
# Syntax: 
# newQuery(
#  view = c("VIEW.NAME.1", "VIEW.NAME.2"), 
#  constraintLogic = "X and (Y or Z)"      #constraintLogic is optional.
# )    
query2UpInPancreas = newQuery(
  # Choose which columns of data we'd like to see
  view = c(
      
  ),
  # set the logic for constraints. This means our pancreas expression level 
  # is EITHER Medium (B) or High (C), but not both.
  # --
  # Note: Constraint logic only needs to be set if you wish to use OR. All other
  # constraints have AND logic applied by default. 
  constraintLogic = # Add your logic here
)

# Add the constraint to our expressed pancreas query (previously we just _defined_ the constraint)
# Syntax: QUERYNAME$where <- CONSTRAINT

Remember, that was just setting up the query - we haven't run it yet

In [ ]:

# Now we have the query set up the way we want, let's actually *run* the query! 
# Syntax: runQuery(im,QUERY_TO_RUN)
query2UpInPancreasResults 

# Show me the first few results please! 
head(query2UpInPancreasResults)

Intersection: Which genes overlap in Query1 and Query2?¶

Let's check which genes are in BOTH lists that we've created. To do this, we'll strip down the columns we have to just primary identifiers, and then run a list intersect function.

Extract the "Gene.primaryIdentifier" from both of your result lists - query1DiabetesResults and query2UpInPancreasResults.
Run intersect(col1, col2) on the columns you extracted
Print the results and admire them!

In [ ]:

# Extract the primaryIdentifier columns from query1 (diabetes genes) and query 2 (upexpressed in pancreas)
# 
# Syntax to get an individual column view: 
# data.frame[["COLUMN.NAME"]]
primaryIdentifiersDiabetes <- # put your code here
primaryIdentifiersPancreas <- # put your code here

# Find the intersection of the two vectors of primary identifiers, and store it in a variable
#
# Syntax: intersect(VECTOR_1, VECTOR_2)

# Show the results using the print(VARIABLE_TO_PRINT) method.

GWAS: Compare the intersection above with results from GWAS studies¶

Finally, we fed the intersected list from above back into another query to see if there was any association of these genes with diabetes phenotypes according to GWAS studies. Note that we now start our query from the GWAS class.

Start by making a constraint with the following values:

"GWAS.results.pValue" is less than or equals (use the operator <=) to "1e-04". (1e-04 is the same as 0.0001).
"GWAS.results.phenotype" must contain (operator CONTAINS) the word diabetes.
"GWAS.results.associatedGenes.primaryIdentifier" should be equal to each of the values from your previous step. You'll need to repeat this constraint once per identifier.

In [ ]:

# First, we set up the constraints. The last three constraints are the 
# diabetesAndPancreas result genes from our last query.
#
# Syntax: 
# 
# setConstraints(
#    paths = c("CONSTRAINT.PATH.A", "CONSTRAINT.PATH.B"),
#    operators = c("CONSTRAINT.OPERATOR.A", "CONSTRAINT.OPERATOR.B"),
#    values = list("CONSTRAINT.VALUE.A","CONSTRAINT.VALUE.B")
# )
query3GWASConstraints <- setConstraints(
    # put your constraints here
  )

Now we've set our constraints up nicely, let's choose which columns we want to view.

In [ ]:

query3GWAS <- newQuery( 
  # Quite a few columns this time!
  view = c("GWAS.results.associatedGenes.primaryIdentifier",
    "GWAS.results.associatedGenes.symbol", "GWAS.results.associatedGenes.name",
    "GWAS.results.SNP.primaryIdentifier", "GWAS.results.pValue", "GWAS.results.phenotype",
    "GWAS.firstAuthor", "GWAS.name", "GWAS.publication.pubMedId",
    "GWAS.results.associatedGenes.organism.shortName"),
    # set the logic for constraints. Remember that we want our results
    # to include any one of the three genes we found in the list of diabetes+pancreas genes
    # so we need to use some OR logic.
  constraintLogic = #Fill the constraint logic in here. Remember the syntax: "A and (B or C)"
)

Add the constraints to the query, and then run it - we've filled this step in for you

In [ ]:

#add constraint
query3GWAS$where <- query3GWASConstraints
#run query
query3GWASResults <- runQuery(im, query3GWAS)

Now, let's view those results...

In [ ]:

query3GWASResults

And let's take a look at the unique gene symbols that were returned:

In [ ]:

GWASIds <- query3GWASResults["GWAS.results.associatedGenes.symbol"]
unique(GWASIds)