Intermine-Python: Tutorial 1: The Basics of a Query

Welcome to the first intermine-python tutorial. Over a series of approximately 12 tutorials, we will go through the basics of writing code in Python that allows us to query the intermine database.

This tutorial will tell you about the basics of intermine-python queries and how to write your first query. To get started, you would want to pip install intermine in terminal first. Once you have installed the package, you are good to go!

We start by importing the Service class from InterMine's webservice module.

In [1]:
from intermine.webservice import Service

The Service class has a method called "new_query" that creates a query object:

In [2]:
service = Service("https://www.flymine.org/flymine/service")
query=service.new_query()

A query object defines what we want to extract from the InterMine database. The first part of a query is referred to as the "views". The views define the output columns that we want in our result. Let's query the FlyMine database to extract the symbol, primaryIdentifier and length of all genes.

In [3]:
query.select("Gene.symbol","Gene.primaryIdentifier", "Gene.length")
Out[3]:
<intermine.query.Query at 0x111badd60>

Now that we have added the output columns to our query request we can print the results of our query.

In [4]:
for row in query.rows(start=0,size=10):
    print(row)
Gene: symbol='0610005C13Rik' primaryIdentifier='MGI:1918911' length=None
Gene: symbol='0610006L08Rik' primaryIdentifier='MGI:1923503' length=None
Gene: symbol='0610008J02Rik' primaryIdentifier='MGI:1925547' length=None
Gene: symbol='0610009B22Rik' primaryIdentifier='MGI:1913300' length=None
Gene: symbol='0610009E02Rik' primaryIdentifier='MGI:3698435' length=None
Gene: symbol='0610009F21Rik' primaryIdentifier='MGI:1918921' length=None
Gene: symbol='0610009K14Rik' primaryIdentifier='MGI:1918931' length=None
Gene: symbol='0610009L18Rik' primaryIdentifier='MGI:1914088' length=None
Gene: symbol='0610010F05Rik' primaryIdentifier='MGI:1918925' length=None
Gene: symbol='0610010K14Rik' primaryIdentifier='MGI:1915609' length=None

The query can also be rewritten in the following way.

In [5]:
query=service.new_query("Gene")
query.select("symbol","primaryIdentifier","length")
Out[5]:
<intermine.query.Query at 0x111ba21f0>
In [6]:
for row in query.rows(start=0,size=10):
    print(row)
Gene: symbol='0610005C13Rik' primaryIdentifier='MGI:1918911' length=None
Gene: symbol='0610006L08Rik' primaryIdentifier='MGI:1923503' length=None
Gene: symbol='0610008J02Rik' primaryIdentifier='MGI:1925547' length=None
Gene: symbol='0610009B22Rik' primaryIdentifier='MGI:1913300' length=None
Gene: symbol='0610009E02Rik' primaryIdentifier='MGI:3698435' length=None
Gene: symbol='0610009F21Rik' primaryIdentifier='MGI:1918921' length=None
Gene: symbol='0610009K14Rik' primaryIdentifier='MGI:1918931' length=None
Gene: symbol='0610009L18Rik' primaryIdentifier='MGI:1914088' length=None
Gene: symbol='0610010F05Rik' primaryIdentifier='MGI:1918925' length=None
Gene: symbol='0610010K14Rik' primaryIdentifier='MGI:1915609' length=None

Feel free to use whichever method you find more comfortable. Now, let us try to write a new query that returns all organisms in the database.

In [7]:
query2=service.new_query()
In [8]:
query2.select("Organism.name")
Out[8]:
<intermine.query.Query at 0x111b911c0>

If we want to add another column to our final output, instead of rewriting your query, you can use the add_view method.

In [9]:
query2.add_view("Organism.taxonId")
Out[9]:
<intermine.query.Query at 0x111b911c0>
In [10]:
for row in query2.rows(start=0,size=10):
    print(row)
Organism: name='Anopheles gambiae' taxonId='7165'
Organism: name='Caenorhabditis elegans' taxonId='6239'
Organism: name='Danio rerio' taxonId='7955'
Organism: name='Drosophila ananassae' taxonId='7217'
Organism: name='Drosophila erecta' taxonId='7220'
Organism: name='Drosophila grimshawi' taxonId='7222'
Organism: name='Drosophila melanogaster' taxonId='7227'
Organism: name='Drosophila mojavensis' taxonId='7230'
Organism: name='Drosophila persimilis' taxonId='7234'
Organism: name='Drosophila pseudoobscura' taxonId='7237'

By default, the result will be sorted according to the first column that you defined. If you want to change this sorting order to another column, use the add_sort_order method of the query class.

In [11]:
query2.add_sort_order("Organism.name")
Out[11]:
<intermine.query.Query at 0x111b911c0>
In [12]:
for row in query2.rows(start=0,size=10):
    print(row)
Organism: name='Anopheles gambiae' taxonId='7165'
Organism: name='Caenorhabditis elegans' taxonId='6239'
Organism: name='Danio rerio' taxonId='7955'
Organism: name='Drosophila ananassae' taxonId='7217'
Organism: name='Drosophila erecta' taxonId='7220'
Organism: name='Drosophila grimshawi' taxonId='7222'
Organism: name='Drosophila melanogaster' taxonId='7227'
Organism: name='Drosophila mojavensis' taxonId='7230'
Organism: name='Drosophila persimilis' taxonId='7234'
Organism: name='Drosophila pseudoobscura' taxonId='7237'

As you can see, I've limited the results to only 10 rows. You can change this number if you want to view more or less rows. The above queries will list all the organisms or all the genes in the database, and hence we limited the number of rows in our output. Views or output columns are one part of queries. The second part is to add constraints on these queries. We will take a look at adding constraints in our next tutorial.