UNdata Informal API

The UNdata website offers an official API but it doesn't look overly welcoming to someone not versed in the XML protocol it supports. So here's a hacked solution based on scraping a websearch that let's you search the site for datasets, and then download the one you want as a zipped CSV file that gets automatically parsed into a pandas dataframe.

The UN data search form lets you download data directly from the results page:

UN Data search form

So let's write a simple scraper to grab the results and see if you can download a selected ata file automatically...

If we View Source on the results page we can look for the individual results items - and see what we neeed to parse out.

UN data search results - view source

We also need to have a look at what form the HTTP request for a data download looks like to make sure we get what we need when we do scrape the results...

UN Data results page - with data download request details

In [2]:
#Load in some libraries to handle the web page requests and the web page parsing...
import requests

#You may need to install BeautifulSoup
#!pip3 install beautifulsoup4
from bs4 import BeautifulSoup

#Note - I'm in Python3
from urllib.parse import parse_qs

#The scraper will be limited to just the first results page...
def searchUNdata(q):
    ''' Run a search on the UN data website and scrape the results '''
    
    params={'q':q}
    url='http://data.un.org/Search.aspx'

    response = requests.get(url,params=params)

    soup=BeautifulSoup(response.content)

    results={}

    #Get the list of results
    searchresults=soup.findAll('div',{'class':'Result'})
    
    #For each result, parse out the name of the dataset, the datamart ID and the data filter ID
    for result in searchresults:
        h2=result.find('h2')
        #We can find everything we need in the <a> tag...
        a=h2.find('a')
        p=parse_qs(a.attrs['href'])
        results[a.text]=(p['d'][0],p['f'][0])

    return results
In [3]:
#A couple of helper functions to let us display the results

results=searchUNdata('carbon dioxide')

def printResults(results):
    ''' Nicely print the search results '''
    
    for result in results.keys():
        print(result)


def unDataSearch(q):
    ''' Simple function to take a search phrase, run the search on the UN data site, and print and return the results. '''
    
    results=searchUNdata(q)
    printResults(results)
    return results

printResults(results)

#q='carbon dioxide'
#unDataSearch(q)
Carbon dioxide emissions (CO2), metric tons of CO2 per capita (CDIAC)
Carbon dioxide emissions (CO2), thousand metric tons of CO2 (UNFCCC)
Carbon dioxide emissions (CO2), kg CO2 per $1 GDP (PPP) (CDIAC)
Carbon dioxide (CO2) Emissions without Land Use, Land-Use Change and Forestry (LULUCF), in Gigagrams (Gg)
Carbon dioxide emissions (CO2), thousand metric tons of CO2 (CDIAC)
Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)
Carbon dioxide emissions (CO2), kg CO2 per $1 GDP (PPP) (UNFCCC)
Trade of goods , US$, HS 1992, 28 Inorganic chemicals, precious metal compound, isotope
In [4]:
#Just in case - a helper routine for working with the search results data
def search(d, substr):
    ''' Partial string match search within dict key names '''
    #via http://stackoverflow.com/a/10796050/454773
    
    result = []
    for key in d:
        if substr.lower() in key.lower():
            result.append((key, d[key])) 

    return result
In [5]:
search(results, 'per capita')
Out[5]:
[('Carbon dioxide emissions (CO2), metric tons of CO2 per capita (CDIAC)',
  ('MDG', 'seriesRowID:751')),
 ('Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)',
  ('MDG', 'seriesRowID:752'))]
In [6]:
#Note - I'm in Python3
from io import BytesIO

import zipfile
import pandas as pd

def getUNdata(undataSearchResults,dataset):
    ''' Download a named dataset from the UN Data website and load it into a pandas dataframe '''

    datamartID,seriesRowID=undataSearchResults[dataset]
    
    url='http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter='+seriesRowID+'&DataMartId='+datamartID+'&Format=csv'

    r = requests.get(url)
    
    
    s=BytesIO(r.content)
    z = zipfile.ZipFile(s)
    
    #Show the files in the zip file
    #z.namelist()
    
    #Let's assume we just get one file per zip...
    #Drop any all blank columns
    df=pd.read_csv( BytesIO( z.read( z.namelist()[0] ) )).dropna(axis=1,how='all')
    return df
In [7]:
results=unDataSearch('carbon dioxide')
Carbon dioxide emissions (CO2), metric tons of CO2 per capita (CDIAC)
Carbon dioxide emissions (CO2), thousand metric tons of CO2 (UNFCCC)
Carbon dioxide emissions (CO2), kg CO2 per $1 GDP (PPP) (CDIAC)
Carbon dioxide (CO2) Emissions without Land Use, Land-Use Change and Forestry (LULUCF), in Gigagrams (Gg)
Carbon dioxide emissions (CO2), thousand metric tons of CO2 (CDIAC)
Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)
Carbon dioxide emissions (CO2), kg CO2 per $1 GDP (PPP) (UNFCCC)
Trade of goods , US$, HS 1992, 28 Inorganic chemicals, precious metal compound, isotope
In [13]:
dd=getUNdata(results,'Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)')

#Preview the last few rows
dd[-5:]
Out[13]:
Country or Area Year Value Value Footnotes Value Footnotes.1
922 United States 1991 19.357277 1 1
923 United States 1990 19.746756 1 1
924 NaN NaN NaN NaN NaN
925 footnoteSeqID Footnote NaN NaN NaN
926 1 For Denmark, France, United Kingdom and United... NaN NaN NaN
In [14]:
#One thing to note is that footnotes may appear at the bottom of a dataframe
#We can spot the all empty row and drop rows from that
#We can also drop the footnote related columns
def dropFootnotes(df):
    return df[:pd.isnull(dd).all(1).nonzero()[0][0]].drop(['Value Footnotes','Value Footnotes.1'], 1)

dropFootnotes(dd)[-5:]
Out[14]:
Country or Area Year Value
919 United States 1994 19.903438
920 United States 1993 19.788616
921 United States 1992 19.568633
922 United States 1991 19.357277
923 United States 1990 19.746756
In [18]:
#Create a function that automatically drops the footnotes and any empty rows
def getUNdata2(undataSearchResults, dataset, footnotes=False):
    df=getUNdata(undataSearchResults, dataset)
    if footnotes:
        return df
    return dropFootnotes(df)
    
In [24]:
getUNdata2(results,'Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)')[-5:]
Out[24]:
Country or Area Year Value
919 United States 1994 19.903438
920 United States 1993 19.788616
921 United States 1992 19.568633
922 United States 1991 19.357277
923 United States 1990 19.746756
In [23]:
getUNdata2(results,'Carbon dioxide emissions (CO2), metric tons of CO2 per capita (UNFCCC)',footnotes=True)[-5:]
Out[23]:
Country or Area Year Value Value Footnotes Value Footnotes.1
922 United States 1991 19.357277 1 1
923 United States 1990 19.746756 1 1
924 NaN NaN NaN NaN NaN
925 footnoteSeqID Footnote NaN NaN NaN
926 1 For Denmark, France, United Kingdom and United... NaN NaN NaN

Summary

This notebook demonstrates a simple, informal scraper based API to the UN data website. Searches can be run on the UN data website to obtain a list of named datasets, and then a specified named dataset can be automatically downloaded into a pandas dataframe.