Getting some top-level data from the DigitalNZ API

This notebook pokes around at the top-level of DigitalNZ, mainly using facets.

See the API documentation for more detailed information.

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.

Some tips:

  • Code cells have boxes around them. When you hover over them a icon appears.
  • To run a code cell either click the icon, or click on the cell and then hit Shift+Enter. The Shift+Enter combo will also move you to the next cell, so it's a quick way to work through the notebook.
  • While a cell is running a * appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.
  • In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.
  • To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.

In [1]:
import requests
import pandas as pd
import altair as alt
from IPython.display import display, HTML

Get yourself an API key and paste it between the quotes below.

In [ ]:
api_key = 'YOUR API KEY'
print('Your API key is: {}'.format(api_key))
In [3]:
# Base url for queries
api_search_url = 'http://api.digitalnz.org/v3/records.json'

# Set up the query params (we'll change these later)
# Let's start with an empty text query to look at everything
def set_params():
    params = {
        'api_key': api_key,
        'text': ''
    }
    return params
In [4]:
def get_data(params):
    '''
    Retrieve an API query and extract the JSON payload.
    '''
    response = requests.get(api_search_url, params=params)
    return response.json()

Hello world!

In [6]:
# How many items are there?
params = set_params()
data = get_data(params)
print(' There are {:,} items'.format(data['search']['result_count']))
 There are 31,473,190 items

Items by century

In [8]:
params['facets'] = 'century'
data = get_data(params)
In [9]:
centuries = data['search']['facets']['century']
centuries_df = pd.Series(centuries).to_frame().reset_index()
centuries_df.columns = ['century', 'count']
centuries_df
Out[9]:
century count
0 1900 16956632
1 1800 11129185
2 2000 2317364
3 1700 3686
4 1600 1435
5 1500 355
6 1300 197
7 1400 168
8 8000 152
9 1200 95
In [10]:
c1 = alt.Chart(centuries_df).mark_bar().encode(
    x = 'century:O',
    y = 'count:Q',
    tooltip = alt.Tooltip('count', format=',')
)
c2 = alt.Chart(centuries_df).mark_bar().encode(
    x = 'century:O',
    y = alt.Y('count:Q', 
          scale=alt.Scale(type='log')),
    tooltip = alt.Tooltip('count', format=',')
)
c1 | c2
Out[10]:

Items by decade

In [12]:
params['facets'] = 'decade'
params['facets_per_page'] = 25
data = get_data(params)
In [13]:
decades = data['search']['facets']['decade']
decades_df = pd.Series(decades).to_frame().reset_index()
decades_df.columns = ['decade', 'count']
decades_df.head()
Out[13]:
decade count
0 1900 6451880
1 1910 6160236
2 1890 4749153
3 1880 3654340
4 1870 1837913
In [14]:
alt.Chart(decades_df).mark_bar().encode(
    x = 'decade:O',
    y = 'count:Q',
    tooltip = alt.Tooltip('count', format=',')
)
Out[14]:

Top 25 collections

In [16]:
params['facets'] = 'display_collection'
params['facets_per_page'] = 26
data = get_data(params)
In [17]:
# Note that the facet is called 'primary_collection' in the results!
collections = data['search']['facets']['primary_collection']
collections_df = pd.Series(collections).to_frame().reset_index()
collections_df.columns = ['collection', 'count']
collections_df.head()
Out[17]:
collection count
0 Papers Past 26122891
1 Radio New Zealand 714178
2 iNaturalist NZ — Mātaki Taiao 503899
3 TAPUHI 319349
4 Auckland Libraries Heritage Images Collection 273689

Papers Past is so much bigger than anything else, let's exclude it from the chart.

In [18]:
alt.Chart(collections_df[1:]).mark_bar().encode(
    x=alt.X('count:Q'),
    y=alt.Y('collection:N'),
    tooltip = alt.Tooltip('count', format=',')
)
Out[18]:

Create a dataset of all collections

In [20]:
more = True
all_collections = {}
params['facets'] = 'display_collection'
params['facets_per_page'] = 100
params['facets_page'] = 1
while more:
    data = get_data(params)
    facets = data['search']['facets']['primary_collection']
    if facets:
        all_collections.update(facets)
        params['facets_page'] += 1
    else:
        more = False
In [21]:
all_collections_df = pd.Series(all_collections).to_frame().reset_index()
all_collections_df.columns = ['collection', 'count']
all_collections_df.head()
Out[21]:
collection count
0 Papers Past 26122891
1 Radio New Zealand 714178
2 iNaturalist NZ — Mātaki Taiao 503899
3 TAPUHI 319349
4 Auckland Libraries Heritage Images Collection 273689
In [22]:
all_collections_df.to_csv('digitalnz_collections.csv', index=False)
display(HTML('<a href="digitalnz_collections.csv">Download CSV file</a>'))

Top 25 newspapers in Papers Past

In [23]:
params['facets'] = 'collection'
params['and[display_collection][]'] = 'Papers Past'
params['facets_per_page'] = 26
params['facets_page'] = 1
data = get_data(params)
In [24]:
newspapers = data['search']['facets']['collection']
newspapers_df = pd.Series(newspapers).to_frame().reset_index()
newspapers_df.columns = ['newspaper', 'count']
newspapers_df.head()
Out[24]:
newspaper count
0 Papers Past 26122891
1 Evening Post 3772939
2 Otago Daily Times 1583124
3 Wanganui Chronicle 1163212
4 Hawera & Normanby Star 1075326
In [25]:
alt.Chart(newspapers_df[1:]).mark_bar().encode(
    x=alt.X('count:Q'),
    y=alt.Y('newspaper:N'),
    tooltip = alt.Tooltip('count', format=',')
)
Out[25]:

All newspapers in Papers Past

In [26]:
more = True
all_newspapers = {}
params['facets'] = 'collection'
params['and[display_collection][]'] = 'Papers Past'
params['facets_per_page'] = 100
params['facets_page'] = 1
while more:
    data = get_data(params)
    facets = data['search']['facets']['collection']
    if facets:
        all_newspapers.update(facets)
        params['facets_page'] += 1
    else:
        more = False
In [27]:
all_newspapers_df = pd.Series(all_newspapers).to_frame().reset_index()
all_newspapers_df.columns = ['newspaper', 'count']
all_newspapers_df.head()
Out[27]:
newspaper count
0 Papers Past 26122891
1 Evening Post 3772939
2 Otago Daily Times 1583124
3 Wanganui Chronicle 1163212
4 Hawera & Normanby Star 1075326
In [28]:
all_newspapers_df[1:].to_csv('paperspast_newspapers.csv', index=False)
display(HTML('<a href="paperspast_newspapers.csv">Download CSV file</a>'))