Explore collection objects over time

In this notebook we'll explore the temporal dimensions of the object data. When were objects created, collected, or used? To do that we'll extract the nested temporal data, see what's there, and create a few charts.

See here for an introduction to the object data, and here to explore places associated with objects.

If you haven't already, you'll either need to harvest the object data, or unzip a pre-harvested dataset.

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!

Some tips:

  • Code cells have boxes around them.
  • To run a code cell click on the cell and then hit Shift+Enter. The Shift+Enter combo will also move you to the next cell, so it's a quick way to work through the notebook.
  • While a cell is running a * appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.
  • In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.
  • To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.

Is this thing on? If you can't edit or run any of the code cells, you might be viewing a static (read only) version of this notebook. Click here to load a live version running on Binder.

Import what we need

In [5]:
import pandas as pd
from ipyleaflet import Map, Marker, Popup, MarkerCluster
import ipywidgets as widgets
from tinydb import TinyDB, Query
from pandas import json_normalize
import altair as alt
from IPython.display import display, HTML, FileLink

Load the harvested data

In [2]:
# Load JSON data from file
db = TinyDB('nma_object_db.json')
records = db.all()
Object = Query()
In [3]:
# Convert to a dataframe
df = pd.DataFrame(records)

Extract the nested events data

Events are linked to objects through the temporal field. This field contains nested data that we need to extract and flatten so we can work with it easily. We'll use json_normalize to extract the nested data and save each event to a new row.

In [6]:
# Use json_normalise() to explode the temporal into multiple rows and columns
# Then merge the exploded rows back with the original dataset using the id value
# df_dates = pd.merge(df.loc[df['temporal'].notnull()], json_normalize(df.loc[df['temporal'].notnull()].to_dict('records'), record_path='temporal', meta=['id'], record_prefix='temporal_'), how='inner', on='id')
df_dates = json_normalize(df.loc[df['temporal'].notnull()].to_dict('records'), record_path='temporal', meta=['id', 'title', 'additionalType'], record_prefix='temporal_')
df_dates.head()
Out[6]:
temporal_type temporal_title temporal_startDate temporal_endDate temporal_interactionType temporal_roleName temporal_description id title additionalType
0 Event 21 February 2009 2009-02-21 2009-02-21 Production NaN NaN 195843 Reproduction cartoon titled 'Better than the b... [Political cartoons]
1 Event June 1908 1908-06 1908-06 NaN Date of use NaN 31257 Kind Regards From Newtown [Postcards]
2 Event 26 January 1982 1982-01-26 1982-01-26 NaN Associated date NaN 135579 Protests during the campaign to save the Frank... [Photographs]
3 Event 1926 1926 1926 NaN Date acquired by donor by Australian Institute of Anatomy 6840 Spinning top [Centre of gravity toys]
4 Event 1872 1872 1872 NaN Associated date NaN 251967 Financial document from Tirranna Picnic Race C... [Financial records]

Now instead of having one row for each object, we have one row for each object event.

How many date records do we have?

In [7]:
df_dates.shape
Out[7]:
(39219, 10)

Exploring events

Let's extract years from the dates to make comparisons a bit easier.

In [8]:
# Use a regular expression to find the first four digits in the date fields
df_dates['start_year'] = df_dates['temporal_startDate'].str.extract(r'^(\d{4})').fillna(0).astype('int')
df_dates['end_year'] = df_dates['temporal_endDate'].str.extract(r'^(\d{4})').fillna(0).astype('int')

What's the earliest start_year (greater than 0)?

In [9]:
df_dates.loc[df_dates['start_year'] > 0]['start_year'].min()
Out[9]:
1001

What is it?

In [10]:
earliest = df_dates.loc[df_dates.loc[df_dates['start_year'] > 0]['start_year'].idxmin()]
display(HTML('<a href="http://collectionsearch.nma.gov.au/?object={}">{}</a>'.format(earliest['id'], earliest['title'])))

What's the latest end date?

In [11]:
df_dates['end_year'].max()
Out[11]:
2992

Oh, that doesn't look quite right! Let's look to see how many of the dates are in the future!

In [12]:
df_dates.loc[(df_dates['start_year'] > 2019) | (df_dates['end_year'] > 2019)]
Out[12]:
temporal_type temporal_title temporal_startDate temporal_endDate temporal_interactionType temporal_roleName temporal_description id title additionalType start_year end_year
5787 Event 17 September 2082 2082-09-17 2082-09-17 Production NaN NaN 213266 Courtroom sketch 'NT Ranger, Mr. Roth.' by Ver... [Courtroom drawings] 2082 2082
6360 Event 7 January 2085 2085-01-07 2085-01-07 Production NaN NaN 195336 Woven basket with feathers and ochre [Baskets] 2085 2085
12505 Event 20 March 2085 2085-03-20 2085-03-20 Production NaN NaN 146492 Feathered stick with handle [Ornaments] 2085 2085
23828 Event 12 December 2992 2992-12-12 2992-12-12 NaN Associated date NaN 67099 Souvenir beaker - Princess Anne [Commemorative mugs] 2992 2992

Looks like these records need some editing.

Types of events

Events are linked to objects in many different ways, they might document when the object was created, collected, or acquired by the museum. We can examine the types of relationships that have been documented between events and objects by looking in the temporal_roleName field.

In [13]:
df_dates['temporal_roleName'].value_counts()
Out[13]:
Date of publication       5022
Associated date           4015
Date made                 3950
Date of event             2995
Associated period         2973
Date collected            2503
Date of voyage            2477
Date photographed         1979
Period of use             1706
Date created              1473
Date of production        1030
Date of use                936
Date of issue              857
Date acquired by donor     544
Date acquired by NMA       451
Date written               422
Date of work               399
Date compiled              201
Date worn                  198
Date drawn                 162
Date of Event              139
Date acquired              128
Content created            120
Date posted                116
Date of purchase            78
Date awarded                76
Date printed                70
Date presented              69
Production date             58
Date designed               47
Date of death               24
Date painted                20
Date of restoration         18
Date of conversion          14
Date reprinted              12
Date of correspondence      10
Date of birth                9
Date built                   9
Date of patent               9
Date of Publication          7
date created                 6
date of publication          5
Date Acquired                4
Date of Production           4
date of production           2
Date of Correspondence       2
Date repographed             1
date of correspondence       1
Date reproduced              1
date made                    1
Period of Use                1
Date of Work                 1
Associated Period            1
date painted                 1
Name: temporal_roleName, dtype: int64

Hmmm, you can see that data entry into this field wasn't closely controlled – there are a number of minor variations in capitalisation, format and word order. For example, we have: 'Date of production', 'Date of Production', 'Production date', and 'date of production'!

Some normalisation has taken place though, because of the creation and production related events can be identified through the temporal_interactionType field. What sorts of values does it contain?

In [14]:
df_dates['temporal_interactionType'].value_counts()
Out[14]:
Production    18012
Name: temporal_interactionType, dtype: int64

There's only one value – 'Production'. According to the documentation, a value of 'Production' in interactionType indicates the event was related to the creation of the item. Let's look to see which of the values in roleName have been aggregated by the 'Production' value.

In [15]:
df_dates.loc[(df_dates['temporal_interactionType'] == 'Production')]['temporal_roleName'].value_counts()
Out[15]:
Date of publication       5016
Date made                 3950
Date photographed         1761
Date created              1473
Date of production        1030
Date of issue              674
Date written               422
Date of work               374
Date compiled              201
Date drawn                 162
Content created            120
Date posted                116
Date printed                70
Production date             58
Date designed               47
Date painted                20
Date of restoration         18
Date of conversion          14
Date reprinted              12
Date of correspondence      10
Date of patent               9
Date of Publication          7
date created                 6
date of publication          5
Date of Production           4
Date of Correspondence       2
date of production           2
date of correspondence       1
Date repographed             1
date made                    1
Date reproduced              1
Date of Work                 1
date painted                 1
Name: temporal_roleName, dtype: int64

So the temporal_interactionType field helps us find all the creation-related events without dealing with the variations in the ways event types are described. Yay for normalisation!

Creation dates

Let's create a dataframe that contains just the creation dates.

In [16]:
df_creation_dates = df_dates.loc[(df_dates['temporal_interactionType'] == 'Production')].copy()
In [17]:
df_creation_dates.shape
Out[17]:
(18012, 12)

One other thing to note is that not every event has a start date. Some just have an end date. To make sure we have at least one date for every event, let's create a new year column – we'll set its value to start_year if it exists, or end_year if not.

In [18]:
df_creation_dates['year'] = df_creation_dates.apply(lambda x: x['start_year'] if x['start_year'] else x['end_year'], axis=1)

Time to make a chart! Let's show how the creation events are distributed over time.

In [24]:
# First we'll get the number of objects per year
year_counts = df_creation_dates['year'].value_counts().to_frame().reset_index()
year_counts.columns = ['year', 'count']
In [28]:
# Create a bar chart (limit to years greater than 0)
alt.Chart(year_counts.loc[year_counts['year'] > 0]).mark_bar(size=2).encode(
    
    # Year on the X axis
    x=alt.X('year:Q', axis=alt.Axis(format='c', title='Year of production')),
    
    # Number of objects on the Y axis
    y=alt.Y('count:Q', title='Number of objects'),
    
    # Show details on hover
    tooltip=[alt.Tooltip('year:Q', title='Year'), alt.Tooltip('count():Q', title='Objects', format=',')]
).properties(width=700)
Out[28]:

Ok, so something interesting was happening in 1980 and 1913. Let's see if we can find out what.

In another notebook I showed how you can use the additionalType column to find out about the types of things in the collection. Let's use it to see what types of objects were created in 1980.

Let's explode additionalType and create a new dataframe with the results!

In [29]:
df_creation_dates_types = df_creation_dates.loc[df_creation_dates['additionalType'].notnull()][['id', 'title', 'year', 'additionalType']].explode('additionalType')
df_creation_dates_types.head()
Out[29]:
id title year additionalType
0 195843 Reproduction cartoon titled 'Better than the b... 2009 Political cartoons
5 59924 Walka design from Ernabella 1954 Acrylic paintings
8 33064 Wonderland city, Sydney, 1908 1906 Photographic postcards
10 19877 Cylindrical hollow wood pipe with protruding bowl 1973 Smoking pipes
12 124027 Oak oil stone 1790 Sharpening stones

Now we can filter by year to see what types of things were created in 1980.

In [30]:
created_1980 = df_creation_dates_types.loc[df_creation_dates_types['year'] == 1980]
created_1980.head()
Out[30]:
id title year additionalType
16 166857 Spergularia media 1980 Mounts
28 166221 Centranthera cochinchinensis 1980 Engravings
36 167935 Carpha alpina var. schoenoides 1980 Mounts
60 166367 Persoonia levis 1980 Engravings
79 165539 Triumfetta repens 1980 Engravings

Let's look at the top twenty types of things created in 1980!

In [31]:
created_1980['additionalType'].value_counts()[:20]
Out[31]:
Engravings             1486
Mounts                  743
Folders                 100
Lists                    42
Notes                    36
Boxes                    35
Technical notes          34
Cartoons                  5
Paintings                 4
Placards                  3
Journals                  3
Storybooks                2
Advertising posters       2
Jugs                      2
Books                     2
Botanical drawings        2
Passes                    2
Textbooks                 2
Netballs                  1
Event posters             1
Name: additionalType, dtype: int64

So the vast majority are either 'Engravings' or 'Mounts'. Let's look at one of the 'Engravings' in more detail.

In [32]:
# Filter by Engravings
created_1980.loc[created_1980['additionalType'] == 'Engravings'].head()
Out[32]:
id title year additionalType
28 166221 Centranthera cochinchinensis 1980 Engravings
60 166367 Persoonia levis 1980 Engravings
79 165539 Triumfetta repens 1980 Engravings
155 167443 Hibiscus tiliaceus subsp. hastatus Malvaceae 1980 Engravings
195 167685 Lecanthus solandri 1980 Engravings
In [33]:
# Get the first item
item = created_1980.loc[created_1980['additionalType'] == 'Engravings'].iloc[0]

# Create a link to the collection db
display(HTML('<a href="http://collectionsearch.nma.gov.au/?object={}">{}</a>'.format(item['id'], item['title'])))

If you follow the link you'll find that the engravings were created for a new publication of Banks' Florilegium.

Can you repeat this process to find out what happened in 1913?

Creation dates by object type

Now that we have a dataframe that combines creation dates with object types, we can look at how the creation of particular object types changes over time. For example let's look at 'Photographs' and 'Postcards'.

In [34]:
# Create a dataframe containing just Photographs and Postcards -- use .isin() to filter the additionalType field
df_photos_postcards = df_creation_dates_types.loc[(df_creation_dates_types['year'] > 0) & (df_creation_dates_types['additionalType'].isin(['Photographs', 'Postcards']))]

# Create a stacked bar chart
alt.Chart(df_photos_postcards).mark_bar(size=3).encode(
    
    # Year on the X axis
    x=alt.X('year:Q', axis=alt.Axis(format='c', title='Year of production')),
    
    # Number of objects on the Y axis
    y=alt.Y('count()', title='Number of objects'),
    
    # Color according to the type
    color='additionalType:N',
    
    # Details on hover
    tooltip=[alt.Tooltip('additionalType:N', title='Type'), alt.Tooltip('year:Q', title='Year'), alt.Tooltip('count():Q', title='Objects', format=',')]
).properties(width=700)
Out[34]:

There's 1913 again... It's also interesting to see a shift from postcards to photos in the early decades of the 20th century.

We could add additional types to this chart, but it will get a bit confusing. Let's try another way of charting changes in the creation of the most common object types over time.

First we'll get the top twenty-five object types (which have creation dates) as a list.

In [35]:
# Get most common 25 values and convert to a list
top_types = df_creation_dates_types['additionalType'].value_counts()[:25].index.to_list()
top_types
Out[35]:
['Engravings',
 'Bark paintings',
 'Cartoons',
 'Negatives',
 'Mounts',
 'Photographs',
 'Paintings',
 'Prints',
 'Drawings',
 'Photographic postcards',
 'Acrylic paintings',
 'Letters',
 'Books',
 'Photographic slides',
 'Postcards',
 'Courtroom drawings',
 'Glass plate negatives',
 'Cards',
 'Botanical specimens',
 'Prize certificates',
 'Collecting cards',
 'Posters',
 'Sculptures',
 'Portrait photographs',
 'Telegrams']

Now we'll use the list of top_types to filter the creation dates, so we only have events relating to those types og objects.

In [36]:
# Only include records where the additionalType value is in the list of top_types
df_top_types = df_creation_dates_types.loc[(df_creation_dates_types['year'] > 0) & (df_creation_dates_types['additionalType'].isin(top_types))]
In [53]:
# Get the counts for year / type
top_type_counts = df_top_types.groupby('year')['additionalType'].value_counts().to_frame()
top_type_counts.columns = ['count']
top_type_counts.reset_index(inplace=True)

To chart this data we're going to use circles for each point and create 'bubble lines' for each object type to show how the number of objects created varied year by year.

In [52]:
# Create a chart
alt.Chart(top_type_counts).mark_circle(
    
    # Style the circles
    opacity=0.8,
    stroke='black',
    strokeWidth=1
).encode(
    
    # Year on the X axis
    x=alt.X('year:O', axis=alt.Axis(format='c', title='Year of production', labelAngle=0)),
    
    # Object type on the Y axis
    y=alt.Y('additionalType:N', title='Object type'),
    
    # Size of the circles represents the number of objects
    size=alt.Size('count:Q',
        scale=alt.Scale(range=[0, 2000]),
        legend=alt.Legend(title='Number of objects')
    ),
    
    # Color the circles by object type
    color=alt.Color('additionalType:N', legend=None),
    
    # More details on hover
    tooltip=[alt.Tooltip('additionalType:N', title='Type'), alt.Tooltip('year:O', title='Year'), alt.Tooltip('count:Q', title='Objects', format=',')]
).properties(
    width=700
)
Out[52]:

What patterns can you see? Hover over the cricles for more information. Once again the engravings dominate, but also look at the bark paintings and cartoons, what might be happening there?


Created by Tim Sherratt for the GLAM Workbench.

Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.