Using data APIs with Python¶

The Very Hungry Caterpillar exampe¶

In [121]:

%matplotlib inline
from IPython.display import HTML
import matplotlib.pyplot as plt
import requests
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_style('white')
sns.set_context('talk', font_scale=1.2)

In [101]:

HTML('<blockquote class="twitter-tweet" lang="he"><p lang="en" dir="ltr">Hmm, I don&#39;t know about this caterpillar rearing manual. I thought P.rapae had an obligate association w/ Brassica. <a href="http://t.co/M10dqbOYlN">pic.twitter.com/M10dqbOYlN</a></p>&mdash; Christie Bahlai (@cbahlai) <a href="https://twitter.com/cbahlai/status/597462491166150656">מאי 10, 2015</a></blockquote><script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>')

Out[101]:

Hmm, I don't know about this caterpillar rearing manual. I thought P.rapae had an obligate association w/ Brassica. pic.twitter.com/M10dqbOYlN
— Christie Bahlai (@cbahlai) מאי 10, 2015

In [102]:

HTML('<blockquote class="twitter-tweet" lang="he"><p lang="en" dir="ltr">This is a terrible dataset about caterpillar diet. How did it got published? <a href="http://t.co/XkAq51HxEP">pic.twitter.com/XkAq51HxEP</a></p>&mdash; Timothée Poisot (@tpoi) <a href="https://twitter.com/tpoi/status/591041490618552320">אפריל 23, 2015</a></blockquote><script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>')

Out[102]:

This is a terrible dataset about caterpillar diet. How did it got published? pic.twitter.com/XkAq51HxEP
— Timothée Poisot (@tpoi) אפריל 23, 2015

In [99]:

HTML('<blockquote class="twitter-tweet" data-partner="tweetdeck"><p lang="und" dir="ltr"><a href="https://twitter.com/tpoi">@tpoi</a> <a href="https://twitter.com/kara_woo">@kara_woo</a> <a href="https://twitter.com/cbahlai">@cbahlai</a> <a href="http://t.co/5lj9EzuKjW">pic.twitter.com/5lj9EzuKjW</a></p>&mdash; Yoav Ram (@yoavram) <a href="https://twitter.com/yoavram/status/597518650082365440">May 10, 2015</a></blockquote><script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>')

Out[99]:

@tpoi @kara_woo @cbahlai pic.twitter.com/5lj9EzuKjW
— Yoav Ram (@yoavram) May 10, 2015

In [103]:

HTML('<blockquote class="twitter-tweet" data-partner="tweetdeck"><p lang="en" dir="ltr">[blog] How hungry are caterpillars anyway? <a href="http://t.co/SvImkHYHhR">http://t.co/SvImkHYHhR</a> <a href="https://twitter.com/hashtag/opendata?src=hash">#opendata</a></p>&mdash; Timothée Poisot (@tpoi) <a href="https://twitter.com/tpoi/status/597518409203589122">May 10, 2015</a></blockquote><script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>')

Out[103]:

[blog] How hungry are caterpillars anyway? http://t.co/SvImkHYHhR #opendata
— Timothée Poisot (@tpoi) May 10, 2015

We will learn how to use the Global Biotic Interactions (globi) API with Python to check How hungry are caterpillars anyway? (sort of).

First, have a look at the API and the API docs. It is a RESTful API that returns responses in JSON format over HTTP.

HTTP: protocol for transfering text files on the internet
JSON: file format, very similar to Python's dict.
REST: a common convention for designing web applications that allow querying and retrieving (and sometimes creating, changing and deleting) data.

Let's try it, following Poisot's lead on The Very Hungry Caterpillar.

caterpillar

We will use requests - a Python HTTP library for humans.

In [109]:

response = requests.get("http://api.globalbioticinteractions.org/interaction?sourceTaxon=Pieris&interactionType=eats")
print("OK:", response.ok)

OK: True

The respose payload is in JSON format. Calling the json method will return the payload as a dict:

In [110]:

payload = response.json()
print(len(payload))
print(payload.keys())

2
dict_keys(['columns', 'data'])

The response has two fields, columns and data, corresponding to the data frame's column names and rows. That's great because we can push it right into a pandas.DataFrame:

In [74]:

print(payload['columns'])

['source_taxon_external_id', 'source_taxon_name', 'source_taxon_path', 'source_specimen_life_stage', 'source_specimen_basis_of_record', 'interaction_type', 'target_taxon_external_id', 'target_taxon_name', 'target_taxon_path', 'target_specimen_life_stage', 'target_specimen_basis_of_record', 'latitude', 'longitude', 'study_title']

In [75]:

print(payload['data'][0])

['EOL:174006', 'Pieris marginalis', 'Animalia | Bilateria | Protostomia | Ecdysozoa | Arthropoda | Hexapoda | Insecta | Pterygota | Neoptera | Holometabola | Lepidoptera | Papilionoidea | Pieridae | Pierinae | Pierini | Pierina | Pieris | Pieris marginalis', None, None, 'eats', 'EOL:29914', 'Rubus', 'Plantae | Tracheophyta | Magnoliopsida | Rosales | Rosaceae | Rubus | Rubus status', None, None, None, None, None]

In [112]:

df = pd.DataFrame(data['data'], columns=data['columns'])
print(df.shape)
df.head()

(232, 14)

Out[112]:

	source_taxon_external_id	source_taxon_name	source_taxon_path	source_specimen_life_stage	source_specimen_basis_of_record	interaction_type	target_taxon_external_id	target_taxon_name	target_taxon_path	target_specimen_life_stage	target_specimen_basis_of_record	latitude	longitude	study_title
0	EOL:174006	Pieris marginalis	Animalia \| Bilateria \| Protostomia \| Ecdysozoa...	None	None	eats	EOL:29914	Rubus	Plantae \| Tracheophyta \| Magnoliopsida \| Rosal...	None	None	None	None	None
1	EOL:174006	Pieris marginalis	Animalia \| Bilateria \| Protostomia \| Ecdysozoa...	None	None	eats	EOL:37457	Arabis	Plantae \| Tracheophyta \| Magnoliopsida \| Brass...	None	None	None	None	None
2	EOL:174006	Pieris marginalis	Animalia \| Bilateria \| Protostomia \| Ecdysozoa...	None	None	eats	EOL:37718	Rorippa	Plantae \| Tracheophyta \| Magnoliopsida \| Brass...	None	None	None	None	None
3	EOL:174006	Pieris marginalis	Animalia \| Bilateria \| Protostomia \| Ecdysozoa...	None	None	eats	EOL:37667	Cardamine	Plantae \| Tracheophyta \| Magnoliopsida \| Brass...	None	None	None	None	None
4	EOL:176683	Pieris rapae	Animalia \| Arthropoda \| Insecta \| Lepidoptera ...	None	None	eats	EOL:467679	Centaurea melitensis	Plantae \| Tracheophyta \| Magnoliopsida \| Aster...	None	None	None	None	None

Let's see what each caterpillar eats. We got the eats interactions, so let's just leave the source and target taxons:

In [113]:

cols = df.columns.tolist()
cols.remove('source_taxon_name')
cols.remove('target_taxon_name')
print(cols)

['source_taxon_external_id', 'source_taxon_path', 'source_specimen_life_stage', 'source_specimen_basis_of_record', 'interaction_type', 'target_taxon_external_id', 'target_taxon_path', 'target_specimen_life_stage', 'target_specimen_basis_of_record', 'latitude', 'longitude', 'study_title']

In [114]:

df.drop(labels=cols, axis=1, inplace=True)
df.head()

Out[114]:

	source_taxon_name	target_taxon_name
0	Pieris marginalis	Rubus
1	Pieris marginalis	Arabis
2	Pieris marginalis	Rorippa
3	Pieris marginalis	Cardamine
4	Pieris rapae	Centaurea melitensis

Next, we count how many target taxons occur for each source taxon. For that, we group by source and aggregate by length (I made sure before that each source-target pair appears only once. How??).

The groupby made source_taxon_name become an index rather than a column and that's why we call reset_index.

In [115]:

table = df.groupby(by='source_taxon_name').aggregate(len).reset_index()
table.head()

Out[115]:

	source_taxon_name	target_taxon_name
0	Pieris brassicae	55
1	Pieris brassicoides	3
2	Pieris canidia	10
3	Pieris cheiranthi	1
4	Pieris deota	1

Finally we rename the columns to make them more meaningful and we sort the table by the number of target taxons. Then we print and plot:

In [116]:

table = table.rename(columns={'source_taxon_name':'Pieris species', 'target_taxon_name': 'Number of known items in diet'})
table = table.sort('Number of known items in diet', ascending=False)
table

Out[116]:

	Pieris species	Number of known items in diet
12	Pieris rapae	91
0	Pieris brassicae	55
11	Pieris napi	51
2	Pieris canidia	10
13	Pieris virginiensis	6
8	Pieris marginalis	4
1	Pieris brassicoides	3
6	Pieris krueperi	3
5	Pieris ergane	2
7	Pieris mannii	2
10	Pieris naganum	2
3	Pieris cheiranthi	1
4	Pieris deota	1
9	Pieris melete	1

In [124]:

table.plot(x="Pieris species", y="Number of known items in diet", kind="barh", legend=False)
plt.ylabel('Number of known items in diet')
plt.grid(False)
sns.despine()

Pieris rapae

Pieris brassicae