Explore places associated with collection objects

In this notebook we'll explore the spatial dimensions of the object data. Where were objects created or collected? To do that we'll extract the nested spatial data, see what's there, and create a few maps.

See here for an introduction to the object data, and here to explore objects over time.

If you haven't already, you'll either need to harvest the object data, or unzip a pre-harvested dataset.

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!

Some tips:

  • Code cells have boxes around them.
  • To run a code cell click on the cell and then hit Shift+Enter. The Shift+Enter combo will also move you to the next cell, so it's a quick way to work through the notebook.
  • While a cell is running a * appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.
  • In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.
  • To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.

Is this thing on? If you can't edit or run any of the code cells, you might be viewing a static (read only) version of this notebook. Click here to load a live version running on Binder.

Import what we need

In [1]:
import pandas as pd
from ipyleaflet import Map, Marker, Popup, MarkerCluster, basemap_to_tiles, CircleMarker
import ipywidgets as widgets
from tinydb import TinyDB, Query
import reverse_geocode
from pandas import json_normalize
import altair as alt
from IPython.display import display, HTML, FileLink
from vega_datasets import data as vega_data
import country_converter as coco

Load the harvested data

In [2]:
# Get the JSON data from the local db
db = TinyDB('nma_object_db.json')
records = db.all()
Object = Query()
In [3]:
# Convert to a dataframe
df = pd.DataFrame(records)

How many different places are referred to in object records?

Places are linked to object records through the spatial field. One object record can be linked to multiple places. Let's get a list of all the places linked to object records.

First we'll use json_normalize() to extract the nested lists in the spatial field, creating one row for every linked place.

In [4]:
df_places = json_normalize(df.loc[df['spatial'].notnull()].to_dict('records'), record_path='spatial', meta=['id'], record_prefix='spatial_')
df_places.head()
Out[4]:
spatial_id spatial_type spatial_title spatial_roleName spatial_interactionType spatial_geo spatial_description id
0 693 Place Ernabella, South Australia, Australia Place made Production -26.2642,132.176 NaN 124081
1 1187 Place Mandiupi NaN Production NaN NaN 20174
2 333 Place Broken Hill, New South Wales, Australia Place collected NaN NaN NaN 188741
3 4600 Place Canbbage Tree Island Public School, Cabbage Tr... Place created Production -28.9842,153.457 NaN 42084
4 80126 Place France Associated place NaN NaN NaN 148323

This list will include many duplicates as more than one object will be linked to a particular place. Let's drop duplicates based on the spatial_id and count how many there are.

In [5]:
df_places.drop_duplicates(subset=['spatial_id']).shape[0]
Out[5]:
3336

Let's put the places on a map. First, we'll filter the records to show only those that have geo-coordinates, and then remove the duplicates as before.

In [6]:
df_places_with_geo = df_places.loc[df_places['spatial_geo'].notnull()].drop_duplicates(subset=['spatial_id'])

You might have noticed that the spatial_geo field contains the latitude and longitude, separated by a comma. Let's split the coordinates into separate columns.

In [7]:
df_places_with_geo[['lat', 'lon']] = df_places_with_geo['spatial_geo'].str.split(',', expand=True)
df_places_with_geo.head()
Out[7]:
spatial_id spatial_type spatial_title spatial_roleName spatial_interactionType spatial_geo spatial_description id lat lon
0 693 Place Ernabella, South Australia, Australia Place made Production -26.2642,132.176 NaN 124081 -26.2642 132.176
3 4600 Place Canbbage Tree Island Public School, Cabbage Tr... Place created Production -28.9842,153.457 NaN 42084 -28.9842 153.457
11 80019 Place Central Australia, Northern Territory, Australia NaN Production -24.3617,133.735 NaN 6840 -24.3617 133.735
15 4329 Place Mount Hagen, Western Highlands Province, Papua... Place made Production -5.8581,144.243 NaN 203086 -5.8581 144.243
26 1883 Place Tasmania, Australia NaN Production -41.9253,146.497 NaN 70975 -41.9253 146.497

Ok, let's make a map!

In [8]:
# This loads the country boundaries data
countries = alt.topo_feature(vega_data.world_110m.url, feature='countries')

# First we'll create the world map using the boundaries
background = alt.Chart(countries).mark_geoshape(
    fill='lightgray',
    stroke='white'
).project('equirectangular').properties(width=700)

# Then we'll plot the positions of places using circles
points = alt.Chart(df_places_with_geo).mark_circle(
    
    # Style the circles
    size=10,
    color='steelblue'
).encode(
    
    # Provide the coordinates
    longitude='lon:Q',
    latitude='lat:Q',
    
    # More info on hover
    tooltip=[alt.Tooltip('spatial_title', title='Place')]
).properties(width=700)

# Finally we layer the plotted points on top of the backgroup map
alt.layer(background, points)
Out[8]:

What's missing?

In order to put the places on a map, we filtered out places that didn't have geo-coordinates. But how many of the linked places have coordinates?

In [9]:
'{:.2%} of linked places have geo-coordinates'.format(df_places_with_geo.shape[0] / df_places.drop_duplicates(subset=['spatial_id']).shape[0])
Out[9]:
'32.22% of linked places have geo-coordinates'

Hmmm, so a majority of linked places are actually missing from our map. Let's dig a bit deeper into the spatial records to see if we can work out why there are only geo-coordinates for some records.

Relationships to places

The relationships between places and objects are described in the spatial_roleName column. Let's see what's in there.

In [10]:
df_places['spatial_roleName'].value_counts()
Out[10]:
Place collected                   14604
Associated place                  12293
Place made                         5516
Place depicted                     4037
Place of event                     2158
Place created                      1858
Place used                         1828
Place of use                       1554
Subject                            1298
Place of publication               1276
Place printed                       614
Place of issue                      605
Place of production                 576
Place photographed                  488
Place worn                          364
Place compiled                      205
Place written                       174
Content created                     147
Place designed                      122
Place of execution                  113
Place Made                           66
Place purchased                      57
Place of restoration                 13
Place of component manufacture       10
Place of Origin                       9
place made                            6
Associated Place                      5
Place assembled                       5
Place of conversion                   4
Place of death                        4
Place of birth                        2
Place Collected                       2
Place of Publication                  1
Place of Execution                    1
Place of Use                          1
place of Publication                  1
Name: spatial_roleName, dtype: int64

As you can see there's quite a few variations in format and capitalisation which makes it hard to aggregate. Fortunately the NMA has already applied some normalisation, grouping together all of the relationships that relate to creation or production. These are identified by the value 'Production' in the interactionType field. Let's see which of the roleName values are aggregated by interactionType.

In [11]:
df_places.loc[(df_places['spatial_interactionType'] == 'Production')]['spatial_roleName'].value_counts()
Out[11]:
Place made                        5516
Place created                     1858
Place of publication              1276
Place printed                      614
Place of production                576
Place of issue                     543
Place photographed                 488
Place compiled                     205
Place written                      174
Content created                    147
Place designed                     122
Place of execution                 113
Place Made                          66
Place of restoration                13
Place of component manufacture      10
place made                           6
Place assembled                      5
Place of conversion                  4
place of Publication                 1
Place of Execution                   1
Place of Publication                 1
Name: spatial_roleName, dtype: int64

How many of the places relate to production?

In [12]:
df_places.loc[(df_places['spatial_interactionType'] == 'Production')].shape[0]
Out[12]:
16831

Looking at the numbers above, you might think that the counts by roleName don't seem to add up to the total number with interactionType set to 'Production'. Let's check by finding the number of 'Production' records that have no roleName.

In [13]:
df_places.loc[(df_places['spatial_interactionType'] == 'Production') & (df_places['spatial_roleName'].isnull())].shape[0]
Out[13]:
5092

So, quite a lot of the places with a 'Production' relationship have no roleName. Let's look at a few.

In [14]:
df_places.loc[(df_places['spatial_interactionType'] == 'Production') & (df_places['spatial_roleName'].isnull())].head()
Out[14]:
spatial_id spatial_type spatial_title spatial_roleName spatial_interactionType spatial_geo spatial_description id
1 1187 Place Mandiupi NaN Production NaN NaN 20174
11 80019 Place Central Australia, Northern Territory, Australia NaN Production -24.3617,133.735 NaN 6840
24 621 Place Djinmalinjera, Northern Territory, Australia NaN Production NaN NaN 19877
26 1883 Place Tasmania, Australia NaN Production -41.9253,146.497 NaN 70975
37 20 Place Darwin, Northern Territory, Australia NaN Production -12.45,130.83 NaN 213694

Hmmm, that seems rather odd, but it shouldn't affect us too much. It just makes you wonder how those 'Production' values were set.

Created vs Collected

So according to the data above, it seems we have two major ways of categorising the relationships between places and objects. We can filter the roleName field by 'Place collected', or we can filter interactionType by 'Production'. Is there any overlap between these two groups?

In [15]:
# How many records have both an interactionType equal to 'Production' and a roleName equal to 'Place collected'?
df_places.loc[(df_places['spatial_interactionType'] == 'Production') & (df_places['spatial_roleName'] == 'Place collected')].shape[0]
Out[15]:
0

Nope, no overlap. These two groups don't capture all of the place relationships, but they do represent distinct types of relationships and are roughly equal in size. But before we start making more maps, let's see how many places in each group have geo-coordinates.

First the 'created' places:

In [16]:
created_count = df_places.loc[(df_places['spatial_interactionType'] == 'Production')].shape[0]
created_geo_count = df_places.loc[(df_places['spatial_interactionType'] == 'Production') & (df_places['spatial_geo'].notnull())].shape[0]
print('{} of {} places ({:.2%}) with a "created" relationship have coordinates'.format(created_geo_count, created_count, created_geo_count / created_count))
16546 of 16831 places (98.31%) with a "created" relationship have coordinates

Now the 'collected' places:

In [17]:
collected_count = df_places.loc[(df_places['spatial_roleName'] == 'Place collected')].shape[0]
collected_geo_count = df_places.loc[(df_places['spatial_roleName'] == 'Place collected') & (df_places['spatial_geo'].notnull())].shape[0]
print('{} of {} places ({:.2%}) with a "collected" relationship have coordinates'.format(collected_geo_count, collected_count, collected_geo_count / collected_count))
0 of 14604 places (0.00%) with a "collected" relationship have coordinates

Ok... So in answer to our question above about what's missing, it seems that only places with a 'created' relationship have geo-coordinates. Let's see if we can fix that so we can map both 'created' and 'collected' records.

Enriching the place data

As well as the place data that's embedded in object records, the NMA provides access to all of the place records in its system. These can be harvested from the /place endpoint. Assuming that you've harvested all the place records, we can now use them to enrich the object records.

First we'll load all the places records.

In [18]:
db_places = TinyDB('nma_place_db.json')
place_records = db_places.all()
df_all_places = pd.DataFrame(place_records)

We're going to merge the place records with the object records, but before we do that, let's see if we can add information about country to the records.

The spatial_title field is a string that often (but not always) includes the country as well as the place name. But it doesn't seem like a reliable way of identifying countries. An alternative is to use the geo-coordinates. Through a process known as reverse-geocoding, we can lookup the country that contains a set of coordinates. To do this we're going to use the reverse-geocode package.

In [23]:
def find_country(row):
    '''
    Use reverse-geocode to get country details for a set of coordinates.
    '''
    try:
        coords = tuple([float(c) for c in row['geo'].split(',')]),
        location = reverse_geocode.search(coords)
        country = [location[0]['country_code'], location[0]['country']]
    except AttributeError:
        country = []
    return pd.Series(country, dtype='object')
    
df_all_places[['country_code', 'country']] = df_all_places.apply(find_country, axis=1)                     

Did it work? Let's look at the country values.

In [24]:
df_all_places['country'].value_counts()
Out[24]:
Australia           2973
United Kingdom       185
United States        139
Papua New Guinea      77
Italy                 38
                    ... 
Cambodia               1
Afghanistan            1
Sudan                  1
Mozambique             1
                       1
Name: country, Length: 126, dtype: int64

Now it's time to merge the place data we extracted from the object records, with the complete set of place records. By linking records using the place id, we can append the information from the place records to the object records.

In [25]:
# Merging on the place id in each dataframe -- in the objects data it's 'spatial_id', in the places it's just 'id'
df_places_merged = pd.merge(df_places, df_all_places, how='left', left_on='spatial_id', right_on='id')
df_places_merged.head()
Out[25]:
spatial_id spatial_type spatial_title spatial_roleName spatial_interactionType spatial_geo spatial_description id_x id_y type title geo country_code country
0 693 Place Ernabella, South Australia, Australia Place made Production -26.2642,132.176 NaN 124081 693 place Ernabella, South Australia, Australia -26.2642,132.176 AU Australia
1 1187 Place Mandiupi NaN Production NaN NaN 20174 1187 place Mandiupi NaN NaN NaN
2 333 Place Broken Hill, New South Wales, Australia Place collected NaN NaN NaN 188741 333 place Broken Hill, New South Wales, Australia -31.95,141.45 AU Australia
3 4600 Place Canbbage Tree Island Public School, Cabbage Tr... Place created Production -28.9842,153.457 NaN 42084 4600 place Canbbage Tree Island Public School, Cabbage Tr... -28.9842,153.457 AU Australia
4 80126 Place France Associated place NaN NaN NaN 148323 80126 place France 46.5592,2.2742 FR France

The point of this was to try and get geo-cordinates for more of the places in the object records. Let's see if it worked by repeating our check on 'collected' places. Note that the appended field is geo rather than spatial_geo.

In [26]:
collected_count = df_places_merged.loc[(df_places_merged['spatial_roleName'] == 'Place collected')].shape[0]
collected_geo_count = df_places_merged.loc[(df_places_merged['spatial_roleName'] == 'Place collected') & (df_places_merged['geo'].notnull())].shape[0]
print('{} of {} places ({:.2%}) with a "collected" relationship have coordinates'.format(collected_geo_count, collected_count, collected_geo_count / collected_count))
14512 of 14604 places (99.37%) with a "collected" relationship have coordinates

Huzzah! Now we have geo-coordinates for almost all of the 'collected' places. Let's split the geo field into lats and lons as before.

In [27]:
df_places_merged[['lat', 'lon']] = df_places_merged['geo'].str.split(',', expand=True)

Objects by country

Now that we have a country_code column we can use it to filter our data. For example, let's look at places where objects were created in Australia.

First we'll filter our data by interactionType and country_code.

In [28]:
df_created_aus = df_places_merged.loc[(df_places_merged['spatial_interactionType'] == 'Production') & (df_places_merged['country_code'] == 'AU')]
In [31]:
df_created_aus[['spatial_title', 'lat', 'lon']].drop_duplicates()
Out[31]:
spatial_title lat lon
0 Ernabella, South Australia, Australia -26.2642 132.176
3 Canbbage Tree Island Public School, Cabbage Tr... -28.9842 153.457
11 Central Australia, Northern Territory, Australia -24.3617 133.735
26 Tasmania, Australia -41.9253 146.497
27 Melville Island, Tiwi Islands, Northern Territ... -11.55 130.93
... ... ... ...
59568 Sandy Blight Junction, Northern Territory, Aus... -23.1925 129.56
59582 Merigal, New South Wales, Australia -31.5025 148.262
59697 Hampden, South Australia, Australia -34.15 139.05
59830 Horn (Ngurupai) Island, Torres Strait, Queensl... -10.6069 142.29
60110 Morley, Perth, Western Australia, Australia -31.8872 115.907

789 rows × 3 columns

Now we can create a map. Note that we're changing the map layer in this chart to use just Australian boundaries, not the world.

In [32]:
# remove duplicate places
places = df_created_aus[['spatial_title', 'lat', 'lon']].drop_duplicates()

# Load Australian boundaries
australia = alt.topo_feature('https://raw.githubusercontent.com/GLAM-Workbench/trove-newspapers/master/data/aus_state.geojson', feature='features')

# Create the map of Australia using the boundaries
aus_background = alt.Chart(australia).mark_geoshape(
    
    # Style the map
    fill='lightgray',
    stroke='white'
).project('equirectangular').properties(width=700)

# Plot the places
points = alt.Chart(places).mark_circle(
    
    # Style circle markers
    size=10,
    color='steelblue'
).encode(
    
    # Set position of each place using lat and lon
    longitude='lon:Q',
    latitude='lat:Q',
    
    # More details on hover
    tooltip=[alt.Tooltip('spatial_title', title='Place'), 'lat', 'lon']
).properties(width=700)

# Combine map and points
alt.layer(aus_background, points)
Out[32]:

Hmmm, what's with all that white space at the bottom? If you look closely, you'll see one blue dot right at the bottom of the chart. It's Commonwealth Bay in the Australian Antarctic Territory – technically part of Australia, but perhaps not what we expected. If we want our map centred on the Australian continent, we can filter out points with a latitude of less than -50.

Pandas is fussy about comparing different types of things, so let's make sure it knows that the lat field contains floats.

In [35]:
df_places_merged['lat'] = df_places_merged['lat'].astype('float')

Now we can filter the data.

In [36]:
df_created_aus = df_places_merged.loc[(df_places_merged['spatial_interactionType'] == 'Production') & (df_places_merged['country_code'] == 'AU') & (df_places_merged['lat'] > -50)]

And update our chart.

In [37]:
# Remove duplicate places
places = df_created_aus[['spatial_title', 'lat', 'lon']].drop_duplicates()

# Plot the places
points = alt.Chart(places).mark_circle(
    
    # Style circle markers
    size=10,
    color='steelblue'
).encode(
    
    # Set position of each place using lat and lon
    longitude='lon:Q',
    latitude='lat:Q',
    
    # More details on hover
    tooltip=[alt.Tooltip('spatial_title', title='Place'), 'lat', 'lon']
).properties(width=700)

# Combine map and points
alt.layer(aus_background, points)
Out[37]:

Created vs Collected – second try

Now we have locations for the 'collected' records we can put both 'created' and 'collected' on a map.

To make things a bit easier, let's create a new column which will indicate if the place relationship is either 'collected' or 'created'.

In [38]:
def add_place_status(row):
    '''
    Determine relationship between object and place.
    '''
    if row['spatial_interactionType'] == 'Production':
        status = 'created'
    elif str(row['spatial_roleName']).lower() == 'place collected':
        status = 'collected'
    else:
        status = None
    return status

# Add a new column to the dataframe showing the relationship between place and object
df_places_merged['place_relation'] =  df_places_merged.apply(add_place_status, axis=1)

We'll also filter out places without coordinates.

In [39]:
df_places_merged_with_geo = df_places_merged.loc[(df_places_merged['geo'].notnull()) & (df_places_merged['place_relation'].notnull())]

And remove duplicates, based on both spatial_id and the new place_relation field.

In [40]:
df_places_merged_with_geo = df_places_merged_with_geo.copy().drop_duplicates(subset=['spatial_id', 'place_relation'])
In [41]:
background = alt.Chart(countries).mark_geoshape(
    fill='lightgray',
    stroke='white'
).project('equirectangular').properties(width=700)

# Plot the places
points = alt.Chart(df_places_merged_with_geo).mark_circle(
    size=10,
).encode(
    # Plot places by lat and lon
    longitude='lon:Q',
    latitude='lat:Q',
    
    # Details on hover
    tooltip=[alt.Tooltip('spatial_title', title='Place')],
    
    # Color will show whether 'collected' or 'created'
    color=alt.Color('place_relation:N', legend=alt.Legend(title='Relationship to place'))
).properties(width=700)

# Combine map and points
alt.layer(background, points)
Out[41]: