Intermine-Python: Tutorial 4: Query Features - Joins

In the previous tutorials we learnt about the basic backbone of a query - views and constraints. This short tutorial will talk about another feature of an intermine query - Outer and Inner Joins.

When we add a path to a query, even if it is in the view, then by default there is a constraint involved. Our query will consist of only those records that have information in the fields or attributes that are described by the path.

Let's say that we want to get any genes involved in a biosynthetic process and any publications on them. If a particular gene has the publication information available, then we want to view it but if it does not then we still want to view the general information about the Gene. By default, Intermine is designed to give you an Inner Join, which basically means that no partial matches will be part of the result. However, for the example that we are discussing over here, we would require something known as an Outer Join. An Outer Join on Gene.publications would help in solving the problem for us. The code for the same is given below.

We begin as usual by importing the Service module and creating a Query object.

In [1]:
from intermine.webservice import Service
In [2]:
service = Service("https://www.flymine.org/flymine/service")
query=service.new_query("Gene")

We then select the columns that we want in our final output and add the constraint.

In [3]:
query.select("primaryIdentifier","symbol","publications.year","publications.firstAuthor","publications.title")
Out[3]:
<intermine.query.Query at 0x7f1a76794b70>
In [4]:
query.add_constraint("ontologyAnnotations.ontologyTerm.name","=","*biosynthetic process*")
Out[4]:
<BinaryConstraint: Gene.ontologyAnnotations.ontologyTerm.name = *biosynthetic process*>

And finally, we add an outer join.

In [5]:
query.outerjoin("publications")
Out[5]:
<intermine.query.Query at 0x7f1a76794b70>
In [6]:
for row in query.rows(size=10):
    print(row)
Gene: primaryIdentifier='FBgn0000022' symbol='ac' publications.year=2010 publications.firstAuthor='Schnorrer Frank' publications.title='Systematic genetic analysis of muscle morphogenesis and function in Drosophila.'
Gene: primaryIdentifier='FBgn0000022' symbol='ac' publications.year=2010 publications.firstAuthor='Neely G Gregory' publications.title='A genome-wide Drosophila screen for heat nociception identifies α2δ3 as an evolutionarily conserved pain gene.'
Gene: primaryIdentifier='FBgn0000022' symbol='ac' publications.year=2012 publications.firstAuthor='Comeron Josep M' publications.title='The many landscapes of recombination in Drosophila melanogaster.'
Gene: primaryIdentifier='FBgn0000022' symbol='ac' publications.year=2015 publications.firstAuthor='Nitta Kazuhiro R' publications.title='Conservation of transcription factor binding specificities across 600 million years of bilateria evolution.'
Gene: primaryIdentifier='FBgn0000022' symbol='ac' publications.year=2005 publications.firstAuthor='Hoskins Roger A' publications.title='Rapid and efficient cDNA library screening by self-ligation of inverse PCR products (SLIP).'
Gene: primaryIdentifier='FBgn0000022' symbol='ac' publications.year=2010 publications.firstAuthor='Neely G Gregory' publications.title='A global in vivo Drosophila RNAi screen identifies NOT3 as a conserved regulator of heart function.'
Gene: primaryIdentifier='FBgn0000022' symbol='ac' publications.year=2001 publications.firstAuthor='Benos P V' publications.title='From first base: the sequence of the tip of the X chromosome of Drosophila melanogaster, a comparison of two sequencing strategies.'
Gene: primaryIdentifier='FBgn0000022' symbol='ac' publications.year=2009 publications.firstAuthor='Wheeler Scott R' publications.title='MidExDB: a database of Drosophila CNS midline cell gene expression.'
Gene: primaryIdentifier='FBgn0000022' symbol='ac' publications.year=None publications.firstAuthor='Lefevre G' publications.title='The distribution of randomly recovered X-ray-induced sex-linked genetic effects in Drosophila melanogaster.'
Gene: primaryIdentifier='FBgn0000022' symbol='ac' publications.year=2002 publications.firstAuthor='Ernstrom Glen G' publications.title='Genetics of sensory mechanotransduction.'

Another query feature that Intermine has is the ability to define shorter column names. This can be done using the add_path_description method. I'll show you a short example.

In [7]:
query.add_view("ontologyAnnotations.ontologyTerm.name")
Out[7]:
<intermine.query.Query at 0x7f1a76794b70>
In [8]:
query.add_path_description("ontologyAnnotations.ontologyTerm","Ontology Term")
Out[8]:
<PathDescription: Gene.ontologyAnnotations.ontologyTerm>
In [9]:
query.add_path_description("publications","Pub.")
Out[9]:
<PathDescription: Gene.publications>

This helps us when we want to print our tables into a file and want the column names to be in a more readable format.

In the next tutorial we will look at dealing with the results that are generated.