#!/usr/bin/env python # coding: utf-8 # # Exploring the Te Papa collection API # # Te Papa has a [new collection API](https://data.tepapa.govt.nz/docs/index.html), so I thought I should have a poke around. This notebook is just a preliminary exploration — it's not intended as a tutorial or a guide. There may well be mistakes and misinterpretations. Nonetheless, it might help you get a feel for what's possible. # # In the future I'll add notebooks focused on specific tasks, but for now we're just going to follow our noses and see where we end up. #
#

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.

# #

# Some tips: #

#

#
# In[254]: import requests import pandas as pd import altair as alt from tqdm import tnrange import re from six import iteritems from IPython.display import display, HTML alt.renderers.enable('notebook') # ## Get an API key # # [Sign up here](https://data.tepapa.govt.nz/docs/register.html) for your very own API key. # In[358]: # Insert your API key between the quotes api_key = '' # If you don't have an API key yet, you can leave the above blank and we'll pick up a guest token below print('Your API key is: {}'.format(api_key)) # In[359]: search_endpoint = 'https://data.tepapa.govt.nz/collection/search' object_endpoint = 'https://data.tepapa.govt.nz/collection/object' endpoint = 'https://data.tepapa.govt.nz/collection/{}' headers = { 'x-api-key': api_key, 'Accept': 'application/json' } if not api_key: response = requests.get('https://data.tepapa.govt.nz/collection/search') data = response.json() guest_token = data['guestToken'] headers['Authorization'] = 'Bearer {}'.format(guest_token) # ## What will we search for? # # Here's I'm going to set a keyword that I'll use in my searches throughout this notebook. Feel free to change it to explore your own results. You can also set it to '\*' (an asterix) to return everything. # In[360]: keyword = 'Chinese' # ## Let's make our first API request! # In[361]: # Set our search parameters for passing to Requests params = { 'q': keyword } # In[362]: # Send off the API request # We need to supply the `headers` to authenticate our request with our key response = requests.get(search_endpoint, headers=headers, params=params) # Get the JSON result data data = response.json() # In[363]: # Drill down to get the summary data from our search data['_metadata']['resultset'] # In[261]: print('Your search for "{}" using the Te Papa collection API returned {:,} results.'.format(keyword, data['_metadata']['resultset']['count'])) # ## What sorts of things are in our search results? # # A really useful feature of the API is that you can ask for facets on lots of different fields. You've probably used the facets on the [Te Papa collection search page](https://collections.tepapa.govt.nz/) to narrow down your results. Using the API, you can go even further, using the facets to summarise your results from a variety of different angles. # # Let's start by getting facets from the `type` field. To request facets you have to POST your query to the [search endpoint](https://data.tepapa.govt.nz/docs/resource_SearchResource.html). Fortunately the Python Requests library makes it really easy to create and submit POST requests. All you have to do is supply the name of the field you want facets for, and the number of facets to return. If you set `size` to `5`, you'll get the 5 most frequent values. According to some ElasticSearch docs I found, you should be able to get all the facets by setting `size` to `0` (ie zero), but I couldn't get this to work. # # #### Create POST request data # In[262]: # This is the dictionary that provides the data for the POST request # Here we're saying we want the 5 'types' with the most results # You can change the size parameter as necessary. post_data = { 'query': 'chinese', 'facets': [ { 'field': 'type', 'size': 5 } ] } # #### Make the API request # # Note that we're using the `post` method rather than `get`. If we supply the POST data using the `json` parameter, Requests takes care of all the tricky encoding issues. # In[263]: response = requests.post(search_endpoint, json=post_data, headers=headers) data = response.json() # #### Convert to a dataframe # # Let's convert the results to a Pandas dataframe because Pandas is awesome and it'll make it easier to create charts. # In[264]: types_df = pd.DataFrame(list(data['facets']['type'].items())) # Set columns names types_df.columns = ['Type', 'Count'] # In[265]: # View the results types_df # #### Display the results as a bar chart # In[266]: alt.Chart(types_df).mark_bar().encode( y='Type:O', x='Count', tooltip=[alt.Tooltip('Type:O'), alt.Tooltip('Count')] ) # ## What collections are the objects in? # # One of the great things about the Te Papa API is the richness of the data and all the interrelations between things, people, places, and subjects. But this also makes it a bit of a challenge to understand how everything fits together. On the GitHub site there's a useful summary of the record structures used to represent the different types of things. Here, for example, is what an [object record](https://github.com/te-papa/collections-api/wiki/Collections-API-Object-Model#object) looks like. Using this as a guide we can start to dig down through the data. # # Let's get an overview of the `objects` in our search results by using the `collection` facet. # # #### Create the POST request data # # This time we're using the `filters` parameter to limit our search to things that have the `type` of 'Object'. We're then getting facets on the `collection` field. # # Instead of using `filters` we could include something like `type: Object` in the query string. I think this changes the way the result set is constructed, but I don't know if it affects the results returned. # In[267]: post_data = { 'query': 'chinese', 'filters': [{ 'field': 'type', 'keyword': 'Object' }], 'facets': [ {'field': 'collection', 'size': 20} ] } # #### Make the API request and convert the results to a dataframe # In[268]: # Get the API response response = requests.post(search_endpoint, json=post_data, headers=headers) data = response.json() # In[269]: # Convert to a dataframe objects_df = pd.DataFrame(list(data['facets']['collection'].items())) objects_df.columns = ['Collection', 'Count'] # In[270]: # View the results objects_df # #### Display the results as a bar chart # In[271]: alt.Chart(objects_df).mark_bar().encode( y='Collection:O', x='Count', tooltip=[alt.Tooltip('Collection:O'), alt.Tooltip('Count')] ) # So there's **lots** of photos. Let's see what we can find out about them. # ## When were the photos taken? # # As well as examining categories, we can use facets to display the date range of the results. But what date, and where is it? If you look at the structure of an [object record](https://github.com/te-papa/collections-api/wiki/Collections-API-Object-Model#object), you'll see that the `production` field is actually a list of production 'events' which have a `creationDate` field. In order to get a list of facets for `createdDate` we have to use dot notation to move down through the record hierarchy — so the field for faceting is `production.createdDate`. # # #### The POST data # In[309]: post_data = { 'query': 'chinese', 'filters': [{ 'field': 'collection', 'keyword': 'Photography' }], 'facets': [ {'field': 'production.createdDate', 'size': 100} ] } # #### The API request and response # In[310]: # Get the API response response = requests.post(search_endpoint, json=post_data, headers=headers) data = response.json() # When I tried to access the data from the `production.createdDate` facet I got an error. If we look at the fields returned in the facets we see why. # In[311]: # Let's see what the facet data looks like data['facets'].keys() # Note that the createdDate facet returns two sets of facets -- a 'verbatim' date, which is ISO-formatted, and a timestamp # There's actually two versions of the date facets — `production.createdDate.verbatim` provides ISO formatted dates, while `production.createdDate.temporal` provides timestamps. We'll use the `verbatim` field. # #### Convert the results to a dataframe and do some cleaning # In[312]: # Let's use the 'verbatim' dates photos_df = pd.DataFrame(list(data['facets']['production.createdDate.verbatim'].items())) photos_df.columns = ['Date', 'Count'] photos_df.head() # You can see that the `Date` field is now a mix of days and years. Let's create a new `Year` column and use it to group together the totals. # In[315]: # Create a Year column by slicing the first four chars from the Date photos_df['Year'] = photos_df['Date'].str.slice(0, 4) # Group by Year, summing together the counts years = photos_df.groupby([photos_df['Year']], as_index=False).sum() years.head() # #### Make a chart # In[316]: c1 = alt.Chart(years).mark_bar().encode( x='Year:T', y='Count', tooltip=[alt.Tooltip('Year:T', format='%Y'), 'Count'] ).interactive() c1 # Hmmm, the values for 1957 and 1970 are a bit extraordinary. I wonder what's going on...? # ## What happened in 1970? # # I'd hoped to try and find out what happened in 1970 by limiting the results to those with a `createdDate` of '1970-01-01'. I first tried this query using `filters` to set the value for `production.createdDate`. However, the request returned an error that said the field wasn't facetable. I then tried adding `production.createdDate:"1970-01-01"` to the query string but then I go no results at all. Eventually I found this [in the docs](https://github.com/te-papa/collections-api/wiki/Getting-started): # # > Field search is not possible against nested fields, for example collection:Art is possible, but not production:mccahon or production.contributor.title:mccahon (however all nested text is searchable in general searches) # # So I think I'll need to harvest all the photographs data and then explore offline. That'll have to wait... # ## What are the photos about? # # Let's try another approach. The `depicts` field provides a list of subjects (which I think could themselves be people, places, or categories). By asking for facets from the `depicts` field, we could get a picture of what the photos are about. # # At first I tried getting facets for `depicts.title`, but this didn't work as `title` is a text field. After a bit of trial and error, I realised that asking for facets on `depicts.href` produced useful results. The `href` field is the API link to the full record for the category, so not only does it give us facets, it provides a link to get more information. # #### The POST data # In[278]: post_data = { 'query': 'chinese', 'filters': [{ 'field': 'collection', 'keyword': 'Photography' }], 'facets': [ {'field': 'depicts.href', 'size': 10} ] } # #### The API response # In[279]: # Get the API response response = requests.post(search_endpoint, json=post_data, headers=headers) data = response.json() # #### Convert the results to a dataframe # In[280]: depicts_df = pd.DataFrame(list(data['facets']['depicts.href'].items())) depicts_df.columns = ['Category', 'Count'] depicts_df.head() # #### Add category titles to the dataframe # # So the `href` field by itself isn't very illuminating. But by looking it up using a simple GET request we get lots more data including the category title. # # I thought this could be useful later on so I created a simple function. # In[281]: def get_categories(df): ''' Lookup category urls and get back the title to add to dataframe. ''' for i in tnrange(len(df)): href = df.loc[i]['Category'] response = requests.get(href, headers=headers) title = response.json()['title'] df.at[i, 'Title'] = title return df depicts_df = get_categories(depicts_df) depicts_df # #### Make a chart # # Now we have the titles, let's make a bar chart. # In[282]: alt.Chart(depicts_df).mark_bar().encode( y='Title:O', x='Count', tooltip=['Count'] ) # Hmmm... Of course we should remember that these are only the top 10 facets — we might want to expand the results. But already we can see a few oddities. For example, there's separate entries for 'Men' and 'men'! # # Perhaps more interestingly, the most cited category in our search for 'Chinese' amongst photos in the Te Papa collection is 'Japanese'. That's weird... # ## Why do so many photographs in my search for 'Chinese' have the category 'Japanese'? # # Let's see if we can find out what's going on. First of all, let's try to limit our results to those that cite the 'Japanese' category. Filtering on the category `href` value seems to work. # #### Get the 'Japanese' category href value # In[283]: href = depicts_df.loc[depicts_df['Title'] == 'Japanese']['Category'].values[0] href # #### The POST data # In[284]: post_data = { 'query': 'chinese', 'filters': [ { 'field': 'collection', 'keyword': 'Photography' }, { 'field': 'depicts.href', 'keyword': href } ] } # #### The API response # In[285]: # Get the API response response = requests.post(search_endpoint, json=post_data, headers=headers) data = response.json() # #### View the titles # # Let's just loop through the results and list the titles. # In[286]: for result in data['results']: print(result['title']) # Hmmm... Ok, so we can see why they have the 'Japanese' category attached, but why do they come up in a search for 'Chinese'? # #### Finding the references to our keyword # # From the raw results it's pretty hard to see why these photos are appearing in our search. The function below loops through all the nested records in the data looking for occurances of our keyword. # In[287]: def find_fields(doc, keyword): ''' Find fields that contain the given keyword. Return the name of the field and the parent object. ''' if isinstance(doc, list): for d in doc: for result in find_fields(d, keyword): yield result if isinstance(doc, dict): for k, v in iteritems(doc): if isinstance(v, str) and keyword in v: yield [doc, k] elif isinstance(v, dict): for result in find_fields(v, keyword): yield result elif isinstance(v, list): for d in v: for result in find_fields(d, keyword): yield result fields = list(find_fields(data['results'], keyword)) # Now let's list the results, displaying the type of record the keyword appears in (Object, Place, Category etc), the title of the record, and the context in which the keyword appears. # In[288]: for field in fields: print('\n{} -- {}'.format(field[0]['type'], field[0]['title'])) context = re.search('(.{{0,40}}{}.{{0,40}})'.format(keyword), field[0][field[1]]).group(1) print(context.replace(keyword, '\33[43m{}\033[0m'.format(keyword))) # So what's going on? From the list above you can see that the reference to 'Chinese' comes from a linked record for the `Place` 'Japan'. The default `query` search looks in all nested text fields, including the `scopeNote` of linked `Place` records, which is where the text above comes from. # # On the one hand it's great that the default search looks in all the nested records. But on the other hand it's a bit annoying, because if we want to do anything with the data we'll have to weed out the irrelevant photos. It's a familiar trade-off between discoverability and accuracy. In a web interface it's good to include as much as possible and then relevance rank it in a sensible way. This gives users their best chance of finding what they're after. But it's not so good if you're using an API to assemble a dataset for further analysis. In that case you want to be able to set fairly firm boundaries around your results. # ## Can we exclude results citing the 'Japanese' category? # # Is there a way of excluding categories from our results. Well, sort of... If we try to filter by the `depicts.href` field we run into the same problem with nested field searching as we did with the dates. But unlike a date string, the category `href` value is pretty specific, so we could probably just through it in the `query` string. # # #### Get the 'Japanese' category href value # In[289]: href = depicts_df.loc[depicts_df['Title'] == 'Japanese']['Category'].values[0] href # #### The POST data # In[290]: post_data = { 'query': 'chinese -"{}"'.format(href), 'filters': [ { 'field': 'collection', 'keyword': 'Photography' } ], 'facets': [ {'field': 'depicts.href', 'size': 10} ] } # #### The API response # In[291]: # Get the API response response = requests.post(search_endpoint, json=post_data, headers=headers) data = response.json() # #### How many results do we have now? # In[292]: data['_metadata']['resultset'] # If we look above, we can see we started with 8,920 photos. The 'Japanese' category had 474 results. So I expected we'd have: # # 8,920 - 474 = 8,446 results # # One off... # ## What's next? # # I've had a go at [making maps](Mapping-Te-Papa-collections.ipynb) from some other facets. I think next I want to try harvesting out significant amounts of data. Stay tuned... # In[ ]: