Using data APIs with Python

The Very Hungry Caterpillar exampe

In [121]:
%matplotlib inline
from IPython.display import HTML
import matplotlib.pyplot as plt
import requests
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_style('white')
sns.set_context('talk', font_scale=1.2)
In [101]:
HTML('<blockquote class="twitter-tweet" lang="he"><p lang="en" dir="ltr">Hmm, I don&#39;t know about this caterpillar rearing manual. I thought P.rapae had an obligate association w/ Brassica. <a href="http://t.co/M10dqbOYlN">pic.twitter.com/M10dqbOYlN</a></p>&mdash; Christie Bahlai (@cbahlai) <a href="https://twitter.com/cbahlai/status/597462491166150656">מאי 10, 2015</a></blockquote><script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>')
Out[101]:
In [102]:
HTML('<blockquote class="twitter-tweet" lang="he"><p lang="en" dir="ltr">This is a terrible dataset about caterpillar diet. How did it got published? <a href="http://t.co/XkAq51HxEP">pic.twitter.com/XkAq51HxEP</a></p>&mdash; Timothée Poisot (@tpoi) <a href="https://twitter.com/tpoi/status/591041490618552320">אפריל 23, 2015</a></blockquote><script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>')
Out[102]:
In [99]:
HTML('<blockquote class="twitter-tweet" data-partner="tweetdeck"><p lang="und" dir="ltr"><a href="https://twitter.com/tpoi">@tpoi</a> <a href="https://twitter.com/kara_woo">@kara_woo</a> <a href="https://twitter.com/cbahlai">@cbahlai</a> <a href="http://t.co/5lj9EzuKjW">pic.twitter.com/5lj9EzuKjW</a></p>&mdash; Yoav Ram (@yoavram) <a href="https://twitter.com/yoavram/status/597518650082365440">May 10, 2015</a></blockquote><script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>')
In [103]:
HTML('<blockquote class="twitter-tweet" data-partner="tweetdeck"><p lang="en" dir="ltr">[blog] How hungry are caterpillars anyway? <a href="http://t.co/SvImkHYHhR">http://t.co/SvImkHYHhR</a> <a href="https://twitter.com/hashtag/opendata?src=hash">#opendata</a></p>&mdash; Timothée Poisot (@tpoi) <a href="https://twitter.com/tpoi/status/597518409203589122">May 10, 2015</a></blockquote><script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>')
Out[103]:

We will learn how to use the Global Biotic Interactions (globi) API with Python to check How hungry are caterpillars anyway? (sort of).

First, have a look at the API and the API docs. It is a RESTful API that returns responses in JSON format over HTTP.

  • HTTP: protocol for transfering text files on the internet
  • JSON: file format, very similar to Python's dict.
  • REST: a common convention for designing web applications that allow querying and retrieving (and sometimes creating, changing and deleting) data.

Let's try it, following Poisot's lead on The Very Hungry Caterpillar.

caterpillar

We will use requests - a Python HTTP library for humans.

In [109]:
response = requests.get("http://api.globalbioticinteractions.org/interaction?sourceTaxon=Pieris&interactionType=eats")
print("OK:", response.ok)
OK: True

The respose payload is in JSON format. Calling the json method will return the payload as a dict:

In [110]:
payload = response.json()
print(len(payload))
print(payload.keys())
2
dict_keys(['columns', 'data'])

The response has two fields, columns and data, corresponding to the data frame's column names and rows. That's great because we can push it right into a pandas.DataFrame:

In [74]:
print(payload['columns'])
['source_taxon_external_id', 'source_taxon_name', 'source_taxon_path', 'source_specimen_life_stage', 'source_specimen_basis_of_record', 'interaction_type', 'target_taxon_external_id', 'target_taxon_name', 'target_taxon_path', 'target_specimen_life_stage', 'target_specimen_basis_of_record', 'latitude', 'longitude', 'study_title']
In [75]:
print(payload['data'][0])
['EOL:174006', 'Pieris marginalis', 'Animalia | Bilateria | Protostomia | Ecdysozoa | Arthropoda | Hexapoda | Insecta | Pterygota | Neoptera | Holometabola | Lepidoptera | Papilionoidea | Pieridae | Pierinae | Pierini | Pierina | Pieris | Pieris marginalis', None, None, 'eats', 'EOL:29914', 'Rubus', 'Plantae | Tracheophyta | Magnoliopsida | Rosales | Rosaceae | Rubus | Rubus status', None, None, None, None, None]
In [112]:
df = pd.DataFrame(data['data'], columns=data['columns'])
print(df.shape)
df.head()
(232, 14)
Out[112]:
source_taxon_external_id source_taxon_name source_taxon_path source_specimen_life_stage source_specimen_basis_of_record interaction_type target_taxon_external_id target_taxon_name target_taxon_path target_specimen_life_stage target_specimen_basis_of_record latitude longitude study_title
0 EOL:174006 Pieris marginalis Animalia | Bilateria | Protostomia | Ecdysozoa... None None eats EOL:29914 Rubus Plantae | Tracheophyta | Magnoliopsida | Rosal... None None None None None
1 EOL:174006 Pieris marginalis Animalia | Bilateria | Protostomia | Ecdysozoa... None None eats EOL:37457 Arabis Plantae | Tracheophyta | Magnoliopsida | Brass... None None None None None
2 EOL:174006 Pieris marginalis Animalia | Bilateria | Protostomia | Ecdysozoa... None None eats EOL:37718 Rorippa Plantae | Tracheophyta | Magnoliopsida | Brass... None None None None None
3 EOL:174006 Pieris marginalis Animalia | Bilateria | Protostomia | Ecdysozoa... None None eats EOL:37667 Cardamine Plantae | Tracheophyta | Magnoliopsida | Brass... None None None None None
4 EOL:176683 Pieris rapae Animalia | Arthropoda | Insecta | Lepidoptera ... None None eats EOL:467679 Centaurea melitensis Plantae | Tracheophyta | Magnoliopsida | Aster... None None None None None

Let's see what each caterpillar eats. We got the eats interactions, so let's just leave the source and target taxons:

In [113]:
cols = df.columns.tolist()
cols.remove('source_taxon_name')
cols.remove('target_taxon_name')
print(cols)
['source_taxon_external_id', 'source_taxon_path', 'source_specimen_life_stage', 'source_specimen_basis_of_record', 'interaction_type', 'target_taxon_external_id', 'target_taxon_path', 'target_specimen_life_stage', 'target_specimen_basis_of_record', 'latitude', 'longitude', 'study_title']
In [114]:
df.drop(labels=cols, axis=1, inplace=True)
df.head()
Out[114]:
source_taxon_name target_taxon_name
0 Pieris marginalis Rubus
1 Pieris marginalis Arabis
2 Pieris marginalis Rorippa
3 Pieris marginalis Cardamine
4 Pieris rapae Centaurea melitensis

Next, we count how many target taxons occur for each source taxon. For that, we group by source and aggregate by length (I made sure before that each source-target pair appears only once. How??).

The groupby made source_taxon_name become an index rather than a column and that's why we call reset_index.

In [115]:
table = df.groupby(by='source_taxon_name').aggregate(len).reset_index()
table.head()
Out[115]:
source_taxon_name target_taxon_name
0 Pieris brassicae 55
1 Pieris brassicoides 3
2 Pieris canidia 10
3 Pieris cheiranthi 1
4 Pieris deota 1

Finally we rename the columns to make them more meaningful and we sort the table by the number of target taxons. Then we print and plot:

In [116]:
table = table.rename(columns={'source_taxon_name':'Pieris species', 'target_taxon_name': 'Number of known items in diet'})
table = table.sort('Number of known items in diet', ascending=False)
table
Out[116]:
Pieris species Number of known items in diet
12 Pieris rapae 91
0 Pieris brassicae 55
11 Pieris napi 51
2 Pieris canidia 10
13 Pieris virginiensis 6
8 Pieris marginalis 4
1 Pieris brassicoides 3
6 Pieris krueperi 3
5 Pieris ergane 2
7 Pieris mannii 2
10 Pieris naganum 2
3 Pieris cheiranthi 1
4 Pieris deota 1
9 Pieris melete 1
In [124]:
table.plot(x="Pieris species", y="Number of known items in diet", kind="barh", legend=False)
plt.ylabel('Number of known items in diet')
plt.grid(False)
sns.despine()

Pieris rapae Pieris rapae

Pieris brassicae Pieris brassicae