Explore collection objects over time¶

In this notebook we'll explore the temporal dimensions of the object data. When were objects created, collected, or used? To do that we'll extract the nested temporal data, see what's there, and create a few charts.

See here for an introduction to the object data, and here to explore places associated with objects.

If you haven't already, you'll either need to harvest the object data, or unzip a pre-harvested dataset.

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!

Some tips:

Code cells have boxes around them.
To run a code cell click on the cell and then hit Shift+Enter. The Shift+Enter combo will also move you to the next cell, so it's a quick way to work through the notebook.
While a cell is running a * appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.
In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.
To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.

Is this thing on? If you can't edit or run any of the code cells, you might be viewing a static (read only) version of this notebook. Click here to load a live version running on Binder.

Import what we need¶

In [5]:

import pandas as pd
from ipyleaflet import Map, Marker, Popup, MarkerCluster
import ipywidgets as widgets
from tinydb import TinyDB, Query
from pandas import json_normalize
import altair as alt
from IPython.display import display, HTML, FileLink

Load the harvested data¶

In [2]:

# Load JSON data from file
db = TinyDB('nma_object_db.json')
records = db.all()
Object = Query()

In [3]:

# Convert to a dataframe
df = pd.DataFrame(records)

Extract the nested events data¶

Events are linked to objects through the temporal field. This field contains nested data that we need to extract and flatten so we can work with it easily. We'll use json_normalize to extract the nested data and save each event to a new row.

In [6]:

# Use json_normalise() to explode the temporal into multiple rows and columns
# Then merge the exploded rows back with the original dataset using the id value
# df_dates = pd.merge(df.loc[df['temporal'].notnull()], json_normalize(df.loc[df['temporal'].notnull()].to_dict('records'), record_path='temporal', meta=['id'], record_prefix='temporal_'), how='inner', on='id')
df_dates = json_normalize(df.loc[df['temporal'].notnull()].to_dict('records'), record_path='temporal', meta=['id', 'title', 'additionalType'], record_prefix='temporal_')
df_dates.head()

Out[6]:

	temporal_type	temporal_title	temporal_startDate	temporal_endDate	temporal_interactionType	temporal_roleName	temporal_description	id	title	additionalType
0	Event	21 February 2009	2009-02-21	2009-02-21	Production	NaN	NaN	195843	Reproduction cartoon titled 'Better than the b...	[Political cartoons]
1	Event	June 1908	1908-06	1908-06	NaN	Date of use	NaN	31257	Kind Regards From Newtown	[Postcards]
2	Event	26 January 1982	1982-01-26	1982-01-26	NaN	Associated date	NaN	135579	Protests during the campaign to save the Frank...	[Photographs]
3	Event	1926	1926	1926	NaN	Date acquired by donor	by Australian Institute of Anatomy	6840	Spinning top	[Centre of gravity toys]
4	Event	1872	1872	1872	NaN	Associated date	NaN	251967	Financial document from Tirranna Picnic Race C...	[Financial records]

Now instead of having one row for each object, we have one row for each object event.

How many date records do we have?

In [7]:

df_dates.shape

Out[7]:

(39219, 10)

Exploring events¶

Let's extract years from the dates to make comparisons a bit easier.

In [8]:

# Use a regular expression to find the first four digits in the date fields
df_dates['start_year'] = df_dates['temporal_startDate'].str.extract(r'^(\d{4})').fillna(0).astype('int')
df_dates['end_year'] = df_dates['temporal_endDate'].str.extract(r'^(\d{4})').fillna(0).astype('int')

What's the earliest start_year (greater than 0)?

In [9]:

df_dates.loc[df_dates['start_year'] > 0]['start_year'].min()

Out[9]:

What is it?

In [10]:

earliest = df_dates.loc[df_dates.loc[df_dates['start_year'] > 0]['start_year'].idxmin()]
display(HTML('<a href="http://collectionsearch.nma.gov.au/?object={}">{}</a>'.format(earliest['id'], earliest['title'])))

Poster titled 'Celebrating Indigenous Sport', 'Prime Minister's XI v ATSIC Chairman's XI', 19 April 2001

What's the latest end date?

In [11]:

df_dates['end_year'].max()

Out[11]:

Oh, that doesn't look quite right! Let's look to see how many of the dates are in the future!

In [12]:

df_dates.loc[(df_dates['start_year'] > 2019) | (df_dates['end_year'] > 2019)]

Out[12]:

	temporal_type	temporal_title	temporal_startDate	temporal_endDate	temporal_interactionType	temporal_roleName	temporal_description	id	title	additionalType	start_year	end_year
5787	Event	17 September 2082	2082-09-17	2082-09-17	Production	NaN	NaN	213266	Courtroom sketch 'NT Ranger, Mr. Roth.' by Ver...	[Courtroom drawings]	2082	2082
6360	Event	7 January 2085	2085-01-07	2085-01-07	Production	NaN	NaN	195336	Woven basket with feathers and ochre	[Baskets]	2085	2085
12505	Event	20 March 2085	2085-03-20	2085-03-20	Production	NaN	NaN	146492	Feathered stick with handle	[Ornaments]	2085	2085
23828	Event	12 December 2992	2992-12-12	2992-12-12	NaN	Associated date	NaN	67099	Souvenir beaker - Princess Anne	[Commemorative mugs]	2992	2992

Looks like these records need some editing.

Types of events¶

Events are linked to objects in many different ways, they might document when the object was created, collected, or acquired by the museum. We can examine the types of relationships that have been documented between events and objects by looking in the temporal_roleName field.

In [13]:

df_dates['temporal_roleName'].value_counts()

Out[13]:

Date of publication       5022
Associated date           4015
Date made                 3950
Date of event             2995
Associated period         2973
Date collected            2503
Date of voyage            2477
Date photographed         1979
Period of use             1706
Date created              1473
Date of production        1030
Date of use                936
Date of issue              857
Date acquired by donor     544
Date acquired by NMA       451
Date written               422
Date of work               399
Date compiled              201
Date worn                  198
Date drawn                 162
Date of Event              139
Date acquired              128
Content created            120
Date posted                116
Date of purchase            78
Date awarded                76
Date printed                70
Date presented              69
Production date             58
Date designed               47
Date of death               24
Date painted                20
Date of restoration         18
Date of conversion          14
Date reprinted              12
Date of correspondence      10
Date of birth                9
Date built                   9
Date of patent               9
Date of Publication          7
date created                 6
date of publication          5
Date Acquired                4
Date of Production           4
date of production           2
Date of Correspondence       2
Date repographed             1
date of correspondence       1
Date reproduced              1
date made                    1
Period of Use                1
Date of Work                 1
Associated Period            1
date painted                 1
Name: temporal_roleName, dtype: int64

Hmmm, you can see that data entry into this field wasn't closely controlled – there are a number of minor variations in capitalisation, format and word order. For example, we have: 'Date of production', 'Date of Production', 'Production date', and 'date of production'!

Some normalisation has taken place though, because of the creation and production related events can be identified through the temporal_interactionType field. What sorts of values does it contain?

In [14]:

df_dates['temporal_interactionType'].value_counts()

Out[14]:

Production    18012
Name: temporal_interactionType, dtype: int64

There's only one value – 'Production'. According to the documentation, a value of 'Production' in interactionType indicates the event was related to the creation of the item. Let's look to see which of the values in roleName have been aggregated by the 'Production' value.

In [15]:

df_dates.loc[(df_dates['temporal_interactionType'] == 'Production')]['temporal_roleName'].value_counts()

Out[15]:

Date of publication       5016
Date made                 3950
Date photographed         1761
Date created              1473
Date of production        1030
Date of issue              674
Date written               422
Date of work               374
Date compiled              201
Date drawn                 162
Content created            120
Date posted                116
Date printed                70
Production date             58
Date designed               47
Date painted                20
Date of restoration         18
Date of conversion          14
Date reprinted              12
Date of correspondence      10
Date of patent               9
Date of Publication          7
date created                 6
date of publication          5
Date of Production           4
Date of Correspondence       2
date of production           2
date of correspondence       1
Date repographed             1
date made                    1
Date reproduced              1
Date of Work                 1
date painted                 1
Name: temporal_roleName, dtype: int64

So the temporal_interactionType field helps us find all the creation-related events without dealing with the variations in the ways event types are described. Yay for normalisation!

Creation dates¶

Let's create a dataframe that contains just the creation dates.

In [16]:

df_creation_dates = df_dates.loc[(df_dates['temporal_interactionType'] == 'Production')].copy()

In [17]:

df_creation_dates.shape

Out[17]:

(18012, 12)

One other thing to note is that not every event has a start date. Some just have an end date. To make sure we have at least one date for every event, let's create a new year column – we'll set its value to start_year if it exists, or end_year if not.

In [18]:

df_creation_dates['year'] = df_creation_dates.apply(lambda x: x['start_year'] if x['start_year'] else x['end_year'], axis=1)

Time to make a chart! Let's show how the creation events are distributed over time.

In [24]:

# First we'll get the number of objects per year
year_counts = df_creation_dates['year'].value_counts().to_frame().reset_index()
year_counts.columns = ['year', 'count']

In [28]:

# Create a bar chart (limit to years greater than 0)
alt.Chart(year_counts.loc[year_counts['year'] > 0]).mark_bar(size=2).encode(
    
    # Year on the X axis
    x=alt.X('year:Q', axis=alt.Axis(format='c', title='Year of production')),
    
    # Number of objects on the Y axis
    y=alt.Y('count:Q', title='Number of objects'),
    
    # Show details on hover
    tooltip=[alt.Tooltip('year:Q', title='Year'), alt.Tooltip('count():Q', title='Objects', format=',')]
).properties(width=700)

Out[28]:

Ok, so something interesting was happening in 1980 and 1913. Let's see if we can find out what.

In another notebook I showed how you can use the additionalType column to find out about the types of things in the collection. Let's use it to see what types of objects were created in 1980.

Let's explode additionalType and create a new dataframe with the results!

In [29]:

df_creation_dates_types = df_creation_dates.loc[df_creation_dates['additionalType'].notnull()][['id', 'title', 'year', 'additionalType']].explode('additionalType')
df_creation_dates_types.head()

Out[29]:

	id	title	year	additionalType
0	195843	Reproduction cartoon titled 'Better than the b...	2009	Political cartoons
5	59924	Walka design from Ernabella	1954	Acrylic paintings
8	33064	Wonderland city, Sydney, 1908	1906	Photographic postcards
10	19877	Cylindrical hollow wood pipe with protruding bowl	1973	Smoking pipes
12	124027	Oak oil stone	1790	Sharpening stones

Now we can filter by year to see what types of things were created in 1980.

In [30]:

created_1980 = df_creation_dates_types.loc[df_creation_dates_types['year'] == 1980]
created_1980.head()

Out[30]:

	id	title	year	additionalType
16	166857	Spergularia media	1980	Mounts
28	166221	Centranthera cochinchinensis	1980	Engravings
36	167935	Carpha alpina var. schoenoides	1980	Mounts
60	166367	Persoonia levis	1980	Engravings
79	165539	Triumfetta repens	1980	Engravings

Let's look at the top twenty types of things created in 1980!

In [31]:

created_1980['additionalType'].value_counts()[:20]

Out[31]:

Engravings             1486
Mounts                  743
Folders                 100
Lists                    42
Notes                    36
Boxes                    35
Technical notes          34
Cartoons                  5
Paintings                 4
Placards                  3
Journals                  3
Storybooks                2
Advertising posters       2
Jugs                      2
Books                     2
Botanical drawings        2
Passes                    2
Textbooks                 2
Netballs                  1
Event posters             1
Name: additionalType, dtype: int64

So the vast majority are either 'Engravings' or 'Mounts'. Let's look at one of the 'Engravings' in more detail.

In [32]:

# Filter by Engravings
created_1980.loc[created_1980['additionalType'] == 'Engravings'].head()

Out[32]:

	id	title	year	additionalType
28	166221	Centranthera cochinchinensis	1980	Engravings
60	166367	Persoonia levis	1980	Engravings
79	165539	Triumfetta repens	1980	Engravings
155	167443	Hibiscus tiliaceus subsp. hastatus Malvaceae	1980	Engravings
195	167685	Lecanthus solandri	1980	Engravings

In [33]:

# Get the first item
item = created_1980.loc[created_1980['additionalType'] == 'Engravings'].iloc[0]

# Create a link to the collection db
display(HTML('<a href="http://collectionsearch.nma.gov.au/?object={}">{}</a>'.format(item['id'], item['title'])))

Centranthera cochinchinensis

If you follow the link you'll find that the engravings were created for a new publication of Banks' Florilegium.

Can you repeat this process to find out what happened in 1913?

Creation dates by object type¶

Now that we have a dataframe that combines creation dates with object types, we can look at how the creation of particular object types changes over time. For example let's look at 'Photographs' and 'Postcards'.

In [34]:

# Create a dataframe containing just Photographs and Postcards -- use .isin() to filter the additionalType field
df_photos_postcards = df_creation_dates_types.loc[(df_creation_dates_types['year'] > 0) & (df_creation_dates_types['additionalType'].isin(['Photographs', 'Postcards']))]

# Create a stacked bar chart
alt.Chart(df_photos_postcards).mark_bar(size=3).encode(
    
    # Year on the X axis
    x=alt.X('year:Q', axis=alt.Axis(format='c', title='Year of production')),
    
    # Number of objects on the Y axis
    y=alt.Y('count()', title='Number of objects'),
    
    # Color according to the type
    color='additionalType:N',
    
    # Details on hover
    tooltip=[alt.Tooltip('additionalType:N', title='Type'), alt.Tooltip('year:Q', title='Year'), alt.Tooltip('count():Q', title='Objects', format=',')]
).properties(width=700)

Out[34]:

There's 1913 again... It's also interesting to see a shift from postcards to photos in the early decades of the 20th century.

We could add additional types to this chart, but it will get a bit confusing. Let's try another way of charting changes in the creation of the most common object types over time.

First we'll get the top twenty-five object types (which have creation dates) as a list.

In [35]:

# Get most common 25 values and convert to a list
top_types = df_creation_dates_types['additionalType'].value_counts()[:25].index.to_list()
top_types

Out[35]:

['Engravings',
 'Bark paintings',
 'Cartoons',
 'Negatives',
 'Mounts',
 'Photographs',
 'Paintings',
 'Prints',
 'Drawings',
 'Photographic postcards',
 'Acrylic paintings',
 'Letters',
 'Books',
 'Photographic slides',
 'Postcards',
 'Courtroom drawings',
 'Glass plate negatives',
 'Cards',
 'Botanical specimens',
 'Prize certificates',
 'Collecting cards',
 'Posters',
 'Sculptures',
 'Portrait photographs',
 'Telegrams']

Now we'll use the list of top_types to filter the creation dates, so we only have events relating to those types og objects.

In [36]:

# Only include records where the additionalType value is in the list of top_types
df_top_types = df_creation_dates_types.loc[(df_creation_dates_types['year'] > 0) & (df_creation_dates_types['additionalType'].isin(top_types))]

In [53]:

# Get the counts for year / type
top_type_counts = df_top_types.groupby('year')['additionalType'].value_counts().to_frame()
top_type_counts.columns = ['count']
top_type_counts.reset_index(inplace=True)

To chart this data we're going to use circles for each point and create 'bubble lines' for each object type to show how the number of objects created varied year by year.

In [52]:

# Create a chart
alt.Chart(top_type_counts).mark_circle(
    
    # Style the circles
    opacity=0.8,
    stroke='black',
    strokeWidth=1
).encode(
    
    # Year on the X axis
    x=alt.X('year:O', axis=alt.Axis(format='c', title='Year of production', labelAngle=0)),
    
    # Object type on the Y axis
    y=alt.Y('additionalType:N', title='Object type'),
    
    # Size of the circles represents the number of objects
    size=alt.Size('count:Q',
        scale=alt.Scale(range=[0, 2000]),
        legend=alt.Legend(title='Number of objects')
    ),
    
    # Color the circles by object type
    color=alt.Color('additionalType:N', legend=None),
    
    # More details on hover
    tooltip=[alt.Tooltip('additionalType:N', title='Type'), alt.Tooltip('year:O', title='Year'), alt.Tooltip('count:Q', title='Objects', format=',')]
).properties(
    width=700
)

Out[52]:

What patterns can you see? Hover over the cricles for more information. Once again the engravings dominate, but also look at the bark paintings and cartoons, what might be happening there?

Created by Tim Sherratt for the GLAM Workbench.

Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.