We are going to re-create the workflow we did using the web interface using the R API.
The basic steps are:
Load the InterMine library. If it's not already installed, visit https://bioconductor.org/packages/release/bioc/html/InterMineR.html and follow the instructions to install.
library(InterMineR)
We want to query human data - so let's look and see what InterMines are available.
The method we want to use is listMines()
listMines()
Okay, let's select HumanMine from the list, and store it in a variable called humanMine
Okay, now let's tell InterMineR that we want to use HumanMine for our queries. Store it in a variable called im
.
Important: you'll need an API token for this part so you can access your HumanMine account. You can get your token by logging into HumanMine and going to the account details tab within MyMine. Cut and paste your token into the code below.
# the syntax to initialise an InterMine is as follows:
# im <- initInterMine(mine=SOME-INTERMINE, "YOUR TOKEN HERE")
im <-
Our first query will be to select all human genes that are associate with diabetes.
This will require two constraints:
Gene.organism.name
is equal (use the =
operator) to Homo sapiens
. (HumanMine contains some non-human genes for homology query purposes)diabetes
. This constraint will be asking for Gene.diseases.name
to contain (use the operator CONTAINS
) the string "diabetes"
We'll also select two columns to view:
Gene.primaryIdentifier
Gene.symbol
Store the first query in a variable valled query1Diabetes
.
# We'll be using the setQuery method here, with two arguments: select and where.
#
# Syntax:
# setQuery(
# select = c("VIEW.NAME1", "VIEW.NAME2", .... ),
# where = setConstraints(
# paths = c("CONSTRAINT.PATH.A", "CONSTRAINT.PATH.B"),
# operators = c("CONSTRAINT.OPERATOR.A", "CONSTRAINT.OPERATOR.B"),
# values = list("CONSTRAINT.VALUE.A","CONSTRAINT.VALUE.B")
# )
#)
query1Diabetes <- setQuery(
# here we're choosing which columns of data we'd like to see
select = c( ),
# set the logic for constraints. The first constraint is the first path+operator+value,
# e.g. Gene.organism.name = Homo sapiens, and the second constraint is the combination
# of the second path+operator+value, e.g. Gene.diseases.name CONTAINS diabetes
where = setConstraints(
paths = c( ),
operators = c( ),
values = list( )
)
)
Question to ponder: why did we use =
for our Homo sapiens constraint, but CONTAINS
for our diabetes constraint?
Anyway, we've set the query up, so now let's actually run it. Use the method runQuery
and store the results in a variable called query1DiabetesResults
# Syntax: runQuery(im,QUERY_TO_RUN)
query1DiabetesResults <- #put the code to run the query here
# and let's print out the first few results to make sure it looks like we'd expect:
# Syntax: head(THING_TO_PRINT)
This time we're creating another query, but with slightly more complex constraints. We're looking for genes that are in the public HumanMine list PL_Pax6_Targets
, that are also expressed in the pancreas at a High
or Medium
level.
We'll need a few more constraints than we did in Query 1:
Gene
s should be IN
the list PL_Pax6_Targets
Gene.proteinAtlasExpression.level
should be set to High
OR Medium
. This will require two constraints, one for each of medium and high.Gene.proteinAtlasExpression.tissue.name
should be equal to Pancreas
We'd also like to see a few more columns this time:
Gene
's primaryIdentifier
and symbol
Gene.proteinAtlasExpression.cellType
Gene.proteinAtlasExpression.level
Gene.proteinAtlasExpression.tissue.name
Store it in a variable called query2UpInPancreas
# We don't want to see *all* genes and their expression.
# Let's narrow it down a little by constraining it to genes that are
# of interest, using setConstraints.
# Syntax:
#
# setConstraints(
# paths = c("CONSTRAINT.PATH.A", "CONSTRAINT.PATH.B"),
# operators = c("CONSTRAINT.OPERATOR.A", "CONSTRAINT.OPERATOR.B"),
# values = list("CONSTRAINT.VALUE.A","CONSTRAINT.VALUE.B")
# )
query2UpInPancreasConstraint <- setConstraints(
)
Excellent - we've defined the constraints we want. We still need to choose which columns to view.
# Create a new query.
# Syntax:
# newQuery(
# view = c("VIEW.NAME.1", "VIEW.NAME.2"),
# constraintLogic = "X and (Y or Z)" #constraintLogic is optional.
# )
query2UpInPancreas = newQuery(
# Choose which columns of data we'd like to see
view = c(
),
# set the logic for constraints. This means our pancreas expression level
# is EITHER Medium (B) or High (C), but not both.
# --
# Note: Constraint logic only needs to be set if you wish to use OR. All other
# constraints have AND logic applied by default.
constraintLogic = # Add your logic here
)
# Add the constraint to our expressed pancreas query (previously we just _defined_ the constraint)
# Syntax: QUERYNAME$where <- CONSTRAINT
Remember, that was just setting up the query - we haven't run it yet
# Now we have the query set up the way we want, let's actually *run* the query!
# Syntax: runQuery(im,QUERY_TO_RUN)
query2UpInPancreasResults
# Show me the first few results please!
head(query2UpInPancreasResults)
Let's check which genes are in BOTH lists that we've created. To do this, we'll strip down the columns we have to just primary identifiers, and then run a list intersect function.
"Gene.primaryIdentifier"
from both of your result lists - query1DiabetesResults
and query2UpInPancreasResults
.intersect(col1, col2)
on the columns you extracted# Extract the primaryIdentifier columns from query1 (diabetes genes) and query 2 (upexpressed in pancreas)
#
# Syntax to get an individual column view:
# data.frame[["COLUMN.NAME"]]
primaryIdentifiersDiabetes <- # put your code here
primaryIdentifiersPancreas <- # put your code here
# Find the intersection of the two vectors of primary identifiers, and store it in a variable
#
# Syntax: intersect(VECTOR_1, VECTOR_2)
# Show the results using the print(VARIABLE_TO_PRINT) method.
Finally, we fed the intersected list from above back into another query to see if there was any association of these genes with diabetes phenotypes according to GWAS studies. Note that we now start our query from the GWAS
class.
Start by making a constraint with the following values:
"GWAS.results.pValue"
is less than or equals (use the operator <=
) to "1e-04"
. (1e-04 is the same as 0.0001)."GWAS.results.phenotype"
must contain (operator CONTAINS
) the word diabetes
."GWAS.results.associatedGenes.primaryIdentifier"
should be equal to each of the values from your previous step. You'll need to repeat this constraint once per identifier.# First, we set up the constraints. The last three constraints are the
# diabetesAndPancreas result genes from our last query.
#
# Syntax:
#
# setConstraints(
# paths = c("CONSTRAINT.PATH.A", "CONSTRAINT.PATH.B"),
# operators = c("CONSTRAINT.OPERATOR.A", "CONSTRAINT.OPERATOR.B"),
# values = list("CONSTRAINT.VALUE.A","CONSTRAINT.VALUE.B")
# )
query3GWASConstraints <- setConstraints(
# put your constraints here
)
Now we've set our constraints up nicely, let's choose which columns we want to view.
query3GWAS <- newQuery(
# Quite a few columns this time!
view = c("GWAS.results.associatedGenes.primaryIdentifier",
"GWAS.results.associatedGenes.symbol", "GWAS.results.associatedGenes.name",
"GWAS.results.SNP.primaryIdentifier", "GWAS.results.pValue", "GWAS.results.phenotype",
"GWAS.firstAuthor", "GWAS.name", "GWAS.publication.pubMedId",
"GWAS.results.associatedGenes.organism.shortName"),
# set the logic for constraints. Remember that we want our results
# to include any one of the three genes we found in the list of diabetes+pancreas genes
# so we need to use some OR logic.
constraintLogic = #Fill the constraint logic in here. Remember the syntax: "A and (B or C)"
)
Add the constraints to the query, and then run it - we've filled this step in for you
#add constraint
query3GWAS$where <- query3GWASConstraints
#run query
query3GWASResults <- runQuery(im, query3GWAS)
Now, let's view those results...
query3GWASResults
And let's take a look at the unique gene symbols that were returned:
GWASIds <- query3GWASResults["GWAS.results.associatedGenes.symbol"]
unique(GWASIds)