Exploring object records

In this notebook we'll have a preliminary poke around in the object data harvested from the NMA Collection API. I'll focus here on the basic shape/stats of the data, other notebooks will explore the object data over time and space.

If you haven't already, you'll either need to harvest the object data, or unzip a pre-harvested dataset.

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!

Some tips:

  • Code cells have boxes around them.
  • To run a code cell click on the cell and then hit Shift+Enter. The Shift+Enter combo will also move you to the next cell, so it's a quick way to work through the notebook.
  • While a cell is running a * appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.
  • In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.
  • To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.

Is this thing on? If you can't edit or run any of the code cells, you might be viewing a static (read only) version of this notebook. Click here to load a live version running on Binder.

Import what we need

In [23]:
import pandas as pd
import math
from IPython.display import display, HTML, FileLink
from tinydb import TinyDB, Query
from pandas import json_normalize

Load the harvested data

In [2]:
# Load the harvested data from the json db
db = TinyDB('nma_object_db.json')
records = db.all()
Object = Query()
In [3]:
# Convert to a dataframe
df = pd.DataFrame(records)
df.head()
Out[3]:
id type title _meta additionalType collection identifier medium extent physicalDescription ... isPartOf seeAlso description hasVersion temporal relation hasPart educationalSignificance location acknowledgement
0 145400 object Wahlo and Tribal law by Kevin Gilbert, reprint... {'modified': '2018-07-09', 'issued': '2011-10-... NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 251390 object Pair of woven shoes made from feathers and hair {'modified': '2019-01-17', 'issued': '2018-04-... [Shoes] {'id': '5244', 'type': 'Collection', 'title': ... 2000.0014.0495 [{'type': 'Material', 'title': 'Feather'}, {'t... {'type': 'Measurement', 'length': 260, 'width'... Shoes, the soles of which are made from woven ... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 124081 object Pair of ceremonial shoes {'modified': '2018-12-04', 'issued': '2006-10-... NaN {'id': '1892', 'type': 'Collection', 'title': ... 1992.0089.0165 [{'type': 'Material', 'title': 'Feather'}] {'type': 'Measurement', 'length': 246, 'width'... A pair of ceremonial shoes made with several m... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 21507 object Grinding stone {'modified': '2018-06-19', 'issued': '2014-12-... [Grinding stones] {'id': '2229', 'type': 'Collection', 'title': ... 1985.0288.0109 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 142308 object 'time CHange' [sic] {'modified': '2019-04-15', 'issued': '2012-06-... [Compact discs] {'id': '3893', 'type': 'Collection', 'title': ... AR00213.012 NaN NaN A compact disc, housed within a clear and blac... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 25 columns

The shape of the data

How many objects are there?

In [4]:
print('There are {:,} objects in the collection'.format(df.shape[0]))
There are 86,717 objects in the collection

Obviously not every record has a value for every field, let's create a quick count of the number of values in each field.

In [5]:
df.count()
Out[5]:
id                         86717
type                       86717
title                      86558
_meta                      86717
additionalType             86690
collection                 84289
identifier                 86692
medium                     73952
extent                     64199
physicalDescription        86397
significanceStatement      32468
creator                    25119
spatial                    46773
contributor                40760
isAggregatedBy              4353
isPartOf                   10769
seeAlso                      467
description                 9128
hasVersion                 20159
temporal                   29597
relation                    3096
hasPart                     2350
educationalSignificance      201
location                    1069
acknowledgement              789
dtype: int64

Let's express those counts as a percentage of the total number of records, and display them as a bar chart using Pandas.

In [6]:
# Get field counts and convert to dataframe
field_counts = df.count().to_frame().reset_index()

# Change column headings
field_counts.columns = ['field', 'count']

# Calculate proportion of the total
field_counts['proportion'] = field_counts['count'].apply(lambda x: x / df.shape[0])

# Style the results as a barchart
field_counts.style.bar(subset=['proportion'], color='#d65f5f').format({'proportion': '{:.2%}'.format})
Out[6]:
field count proportion
0 id 86717 100.00%
1 type 86717 100.00%
2 title 86558 99.82%
3 _meta 86717 100.00%
4 additionalType 86690 99.97%
5 collection 84289 97.20%
6 identifier 86692 99.97%
7 medium 73952 85.28%
8 extent 64199 74.03%
9 physicalDescription 86397 99.63%
10 significanceStatement 32468 37.44%
11 creator 25119 28.97%
12 spatial 46773 53.94%
13 contributor 40760 47.00%
14 isAggregatedBy 4353 5.02%
15 isPartOf 10769 12.42%
16 seeAlso 467 0.54%
17 description 9128 10.53%
18 hasVersion 20159 23.25%
19 temporal 29597 34.13%
20 relation 3096 3.57%
21 hasPart 2350 2.71%
22 educationalSignificance 201 0.23%
23 location 1069 1.23%
24 acknowledgement 789 0.91%

Nested data

One thing you might note is that some of the fields contain nested JSON arrays or objects. For example additionalType contains a list of object types, while extent is a dictionary with keys and values. Let's unpack these columns for the second row (index of 1).

In [7]:
df['additionalType'][1][0]
Out[7]:
'Shoes'
In [8]:
df['extent'][1]
Out[8]:
{'type': 'Measurement',
 'length': 260,
 'width': 120,
 'depth': 40,
 'unitText': 'mm'}
In [9]:
df['extent'][1]['length']
Out[9]:
260

The additionalType field

How many objects have values in the additionalType column?

In [10]:
df.loc[df['additionalType'].notnull()].shape
Out[10]:
(86690, 25)
In [11]:
print('{:%} of objects have an additionalType value'.format(df.loc[df['additionalType'].notnull()].shape[0] / df.shape[0]))
99.968864% of objects have an additionalType value

So which ones don't have an additionalType?

In [12]:
# Just show the first 5 rows
df.loc[df['additionalType'].isnull()].head()
Out[12]:
id type title _meta additionalType collection identifier medium extent physicalDescription ... isPartOf seeAlso description hasVersion temporal relation hasPart educationalSignificance location acknowledgement
0 145400 object Wahlo and Tribal law by Kevin Gilbert, reprint... {'modified': '2018-07-09', 'issued': '2011-10-... NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 124081 object Pair of ceremonial shoes {'modified': '2018-12-04', 'issued': '2006-10-... NaN {'id': '1892', 'type': 'Collection', 'title': ... 1992.0089.0165 [{'type': 'Material', 'title': 'Feather'}] {'type': 'Measurement', 'length': 246, 'width'... A pair of ceremonial shoes made with several m... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1728 180161 object Awelye- panel 1 by Lily Kngwarreye {'copyright': '', 'licence': ''} NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1939 224632 object Glass plate negative of family and horse stand... {'copyright': '', 'licence': ''} NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3416 180165 object Awelye- panel 3 by Lily Kngwarreye {'copyright': '', 'licence': ''} NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 25 columns

How many rows have more than one additionalType?

In [13]:
df.loc[df['additionalType'].str.len() > 1].shape[0]
Out[13]:
1038

Let's have a look at a sample.

In [14]:
df.loc[df['additionalType'].str.len() > 1].head()
Out[14]:
id type title _meta additionalType collection identifier medium extent physicalDescription ... isPartOf seeAlso description hasVersion temporal relation hasPart educationalSignificance location acknowledgement
45 202601 object Album of Newspaper clippings {'modified': '2019-04-22', 'issued': '2010-11-... [Albums, Newspaper clippings] {'id': '4760', 'type': 'Collection', 'title': ... 1989.0009.0108 [{'type': 'Material', 'title': 'Cardboard'}, {... {'type': 'Measurement', 'height': 345, 'width'... A brown textured hardback album with gold colo... ... NaN NaN NaN NaN [{'type': 'Event', 'title': '1935', 'startDate... NaN NaN NaN NaN NaN
113 256766 object Handmade wolf figurine in yellow dress likely ... {'modified': '2018-12-13', 'issued': '2018-10-... [Novelty toys, Toys] {'id': '6773', 'type': 'Collection', 'title': ... 2013.0038.0556.005 [{'type': 'Material', 'title': 'Cotton thread'... {'type': 'Measurement', 'height': 88, 'width':... A handmade wolf figurine robed in a yellow dre... ... NaN NaN NaN NaN [{'type': 'Event', 'title': '1925 - 1935', 'st... NaN NaN NaN NaN NaN
133 223557 object Receipt issued to Tirranna Race Club, 1878 {'modified': '2019-04-23', 'issued': '2017-11-... [Invoices, Receipts] {'id': '6139', 'type': 'Collection', 'title': ... 2012.0019.0170 [{'type': 'Material', 'title': 'Ink'}, {'type'... {'type': 'Measurement', 'height': 114, 'width'... A receipt handwritten on a piece of grey paper... ... NaN NaN NaN NaN [{'type': 'Event', 'title': '1878', 'startDate... NaN NaN NaN NaN NaN
219 231018 object Cycling jersey worn by Harry Clarke {'modified': '2019-04-12', 'issued': '2017-03-... [Gee clamps, Sports clothing] {'id': '7017', 'type': 'Collection', 'title': ... 2013.0033.0002 [{'type': 'Material', 'title': 'Polyester clot... {'type': 'Measurement', 'height': 610, 'width'... A short sleeved, striped brown, black and tan ... ... NaN NaN Brown and yellow cycling jersey worn by Harry ... [{'id': '131401', 'type': 'StillImage', 'ident... [{'type': 'Event', 'title': '1988', 'startDate... NaN NaN NaN NaN NaN
301 255447 object Pair of orange leather dolls shoes with pom pom {'modified': '2019-04-24', 'issued': '2018-06-... [Dolls clothing, Shoes] {'id': '6773', 'type': 'Collection', 'title': ... 2013.0038.0315 [{'type': 'Material', 'title': 'Cotton thread'... {'type': 'Measurement', 'height': 25, 'width':... A pair of orange leather dolls shoes with one ... ... NaN NaN NaN NaN [{'type': 'Event', 'title': '1925 - 1935', 'st... NaN NaN NaN NaN NaN

5 rows × 25 columns

The additionalType field contains a nested list of values. Using json_normalize() or explode() we can explode these lists, creating a row for each separate value.

In [15]:
# Use json_normalize to expand 'additionalType' into separate rows, adding the id and title from the parent record
# df_types = json_normalize(df.loc[df['additionalType'].notnull()].to_dict('records'), record_path='additionalType', meta=['id', 'title'], errors='ignore').rename({0: 'additionalType'}, axis=1)

# In pandas v.0.25 and above you can just use explode -- this prodices the same result as above
df_types = df.loc[df['additionalType'].notnull()][['id', 'title', 'additionalType']].explode('additionalType')

df_types.head()
Out[15]:
id title additionalType
1 251390 Pair of woven shoes made from feathers and hair Shoes
3 21507 Grinding stone Grinding stones
4 142308 'time CHange' [sic] Compact discs
5 20174 Ten Days To Live - A supposed sorcery painting. Bark paintings
6 144359 'The Dance of Life (1898-1902)' by Diana Boyer... Booklets

Now that we've exploded the type values, we can aggregate them in different ways. Let's look at the 25 most common object types!

In [16]:
df_types['additionalType'].value_counts()[:25]
Out[16]:
Mineral samples                   6000
Photographs                       4742
Stone artefacts                   4364
Photographic postcards            4250
Drawings                          3755
Postcards                         3697
Zoological specimens              2168
Bark paintings                    2107
Geological specimens              1993
Engravings                        1498
Cartoons                          1384
Negatives                         1124
Boomerangs                        1025
Spears                            1012
Percussion and abrading stones     982
Paintings                          840
Clubs                              747
Mounts                             745
Cards                              709
Armbands                           649
Shells                             563
Letters                            543
Documents                          519
Geophysical survey equipment       509
Posters                            497
Name: additionalType, dtype: int64

How many object types only appear once?

In [17]:
type_counts = df_types['additionalType'].value_counts().to_frame().reset_index().rename({'index': 'type', 'additionalType': 'count'}, axis=1)
unique_types = type_counts.loc[type_counts['count'] == 1]
unique_types.shape[0]
Out[17]:
639
In [18]:
unique_types.head()
Out[18]:
type count
1854 Medications 1
1855 Hollow bits 1
1856 Television cameras 1
1857 Art drawings 1
1858 Electric indicators 1

Let's save the complete list of types as a CSV file.

In [19]:
type_counts.to_csv('nma_object_type_counts.csv', index=False)
display(FileLink('nma_object_type_counts.csv'))

Browsing the CSV I noticed that there was one item with the type Vegetables. Let's find some more out about it.

In [20]:
# Find in the complete data set
mask = df.loc[df['additionalType'].notnull()]['additionalType'].apply(lambda x: 'Vegetables' in x)
veggie = df.loc[df['additionalType'].notnull()][mask]
veggie
Out[20]:
id type title _meta additionalType collection identifier medium extent physicalDescription ... isPartOf seeAlso description hasVersion temporal relation hasPart educationalSignificance location acknowledgement
21559 256742 object Wooden toy toad stalk {'modified': '2019-04-24', 'issued': '2018-10-... [Toys, Vegetables] {'id': '6773', 'type': 'Collection', 'title': ... 2013.0038.0540 [{'type': 'Material', 'title': 'Paint - non sp... {'type': 'Measurement', 'height': 65, 'diamete... A painted wooden toy toad stalk with a red cap... ... NaN NaN NaN NaN [{'type': 'Event', 'title': '1925 - 1935', 'st... NaN NaN NaN NaN NaN

1 rows × 25 columns

We can create a link into the NMA Collections Explorer using the object id.

In [21]:
display(HTML('<a href="http://collectionsearch.nma.gov.au/?object={}">{}</a>'.format(veggie.iloc[0]['id'], veggie.iloc[0]['title'])))

Does a toad stool count as a vegetable?

The extent field

The extent field is a nested object, so once again we'll use json_normalize() to expand it out into separate columns.

In [24]:
# Without reset_index() the rows are misaligned
df_extent = df.loc[df['extent'].notnull()].reset_index().join(json_normalize(df.loc[df['extent'].notnull()]['extent'].tolist()).add_prefix("extent_"))
df_extent.head()
Out[24]:
index id type title _meta additionalType collection identifier medium extent ... acknowledgement extent_type extent_length extent_width extent_depth extent_unitText extent_height extent_diameter extent_weight extent_unitTextWeight
0 1 251390 object Pair of woven shoes made from feathers and hair {'modified': '2019-01-17', 'issued': '2018-04-... [Shoes] {'id': '5244', 'type': 'Collection', 'title': ... 2000.0014.0495 [{'type': 'Material', 'title': 'Feather'}, {'t... {'type': 'Measurement', 'length': 260, 'width'... ... NaN Measurement 260.0 120.0 40.0 mm NaN NaN NaN NaN
1 2 124081 object Pair of ceremonial shoes {'modified': '2018-12-04', 'issued': '2006-10-... NaN {'id': '1892', 'type': 'Collection', 'title': ... 1992.0089.0165 [{'type': 'Material', 'title': 'Feather'}] {'type': 'Measurement', 'length': 246, 'width'... ... NaN Measurement 246.0 190.0 45.0 mm NaN NaN NaN NaN
2 5 20174 object Ten Days To Live - A supposed sorcery painting. {'modified': '2019-04-21', 'issued': '2013-06-... [Bark paintings] {'id': '2202', 'type': 'Collection', 'title': ... 1985.0246.0077 [{'type': 'Material', 'title': 'Bark'}, {'type... {'type': 'Measurement', 'length': 574, 'width'... ... NaN Measurement 574.0 185.0 NaN mm NaN NaN NaN NaN
3 6 144359 object 'The Dance of Life (1898-1902)' by Diana Boyer... {'modified': '2018-06-18', 'issued': '2012-06-... [Booklets] {'id': '3893', 'type': 'Collection', 'title': ... 2008.0043.0022.001 [{'type': 'Material', 'title': 'Paper'}, {'typ... {'type': 'Measurement', 'height': 214, 'width'... ... NaN Measurement NaN 150.0 5.0 mm 214.0 NaN NaN NaN
4 8 42084 object Child's drawing by Lester Moran, Cabbage Tree ... {'modified': '2019-10-14', 'issued': '2016-10-... [Drawings] {'id': '2261', 'type': 'Collection', 'title': ... 1991.0024.0027 [{'type': 'Material', 'title': 'Paint - non sp... {'type': 'Measurement', 'length': 560, 'width'... ... NaN Measurement 560.0 380.0 0.5 mm NaN NaN NaN NaN

5 rows × 35 columns

Let's check to see what types of things are in the extent field.

In [25]:
df_extent['extent_type'].value_counts()
Out[25]:
Measurement    64199
Name: extent_type, dtype: int64

So they're all measurements. Let's have a look at the units being used.

In [26]:
df_extent['extent_unitText'].value_counts()
Out[26]:
mm    63504
MM       10
cm        9
m         5
Name: extent_unitText, dtype: int64
In [27]:
df_extent['extent_unitTextWeight'].value_counts()
Out[27]:
g        1713
kg        212
lb          5
oz          4
tonne       1
Name: extent_unitTextWeight, dtype: int64

Hmmm, are those measurements really in metres, or might they be meant to be 'mm'? Let's have a look at them.

In [28]:
df_extent.loc[df_extent['extent_unitText'] == 'm'][['id', 'title', 'extent_length', 'extent_width', 'extent_unitText']]
Out[28]:
id title extent_length extent_width extent_unitText
8968 202783 The Percival Project, Gull Twelve, in a manill... NaN 230.0 m
13210 257184 Fishing line inside envelope 137.0000 110.0 m
23356 171768 Fair Breeze NaN 138.0 m
31845 123962 Gunter's chain 20.1168 NaN m
63827 214193 Extension tube 55.0000 NaN m

Other than 'Gunter's chain' it looks like the unit should indeed by 'mm'. We'll need to take that into account in calculations.

Now let's convert all the measurements into a single unit – millimetre for lengths, and gram for weights.

In [29]:
def conversion_factor(unit):
    '''
    Get the factor required to convery current unit to either mm or g.
    '''
    factors = {
        'mm': 1,
        'cm': 10,
        'm': 1, # Most should in fact be mm (see above)
        'g': 1,
        'kg': 1000,
        'tonne': 1000000,
        'oz': 28.35,
        'lb': 453.592
    }
    try:
        factor = factors[unit.lower()]
    except KeyError:
        factor = 0 
    return factor

def normalise_measurements(row):
    '''
    Convert measurements to standard units.
    '''
    l_factor = conversion_factor(str(row['extent_unitText']))
    length = row['extent_length'] * l_factor
    width = row['extent_width'] * l_factor
    depth = row['extent_depth'] * l_factor
    height = row['extent_height'] * l_factor
    diameter = row['extent_diameter'] * l_factor
    w_factor = conversion_factor(str(row['extent_unitTextWeight']))
    weight = row['extent_weight'] * w_factor
    return pd.Series([length, width, depth, height, diameter, weight])

# Add normalised measurements to the dataframe
df_extent[['length_mm', 'width_mm', 'depth_mm', 'height_mm', 'diameter_mm', 'weight_g']] = df_extent.apply(normalise_measurements, axis=1)
In [30]:
df_extent.head()
Out[30]:
index id type title _meta additionalType collection identifier medium extent ... extent_height extent_diameter extent_weight extent_unitTextWeight length_mm width_mm depth_mm height_mm diameter_mm weight_g
0 1 251390 object Pair of woven shoes made from feathers and hair {'modified': '2019-01-17', 'issued': '2018-04-... [Shoes] {'id': '5244', 'type': 'Collection', 'title': ... 2000.0014.0495 [{'type': 'Material', 'title': 'Feather'}, {'t... {'type': 'Measurement', 'length': 260, 'width'... ... NaN NaN NaN NaN 260.0 120.0 40.0 NaN NaN NaN
1 2 124081 object Pair of ceremonial shoes {'modified': '2018-12-04', 'issued': '2006-10-... NaN {'id': '1892', 'type': 'Collection', 'title': ... 1992.0089.0165 [{'type': 'Material', 'title': 'Feather'}] {'type': 'Measurement', 'length': 246, 'width'... ... NaN NaN NaN NaN 246.0 190.0 45.0 NaN NaN NaN
2 5 20174 object Ten Days To Live - A supposed sorcery painting. {'modified': '2019-04-21', 'issued': '2013-06-... [Bark paintings] {'id': '2202', 'type': 'Collection', 'title': ... 1985.0246.0077 [{'type': 'Material', 'title': 'Bark'}, {'type... {'type': 'Measurement', 'length': 574, 'width'... ... NaN NaN NaN NaN 574.0 185.0 NaN NaN NaN NaN
3 6 144359 object 'The Dance of Life (1898-1902)' by Diana Boyer... {'modified': '2018-06-18', 'issued': '2012-06-... [Booklets] {'id': '3893', 'type': 'Collection', 'title': ... 2008.0043.0022.001 [{'type': 'Material', 'title': 'Paper'}, {'typ... {'type': 'Measurement', 'height': 214, 'width'... ... 214.0 NaN NaN NaN NaN 150.0 5.0 214.0 NaN NaN
4 8 42084 object Child's drawing by Lester Moran, Cabbage Tree ... {'modified': '2019-10-14', 'issued': '2016-10-... [Drawings] {'id': '2261', 'type': 'Collection', 'title': ... 1991.0024.0027 [{'type': 'Material', 'title': 'Paint - non sp... {'type': 'Measurement', 'length': 560, 'width'... ... NaN NaN NaN NaN 560.0 380.0 0.5 NaN NaN NaN

5 rows × 41 columns

How big is the collection?

In [31]:
def calculate_volume(row):
    '''
    Look for 3 linear dimensions and multiply them to get a volume.
    '''
    # Create a list of valid linear measurements from the available fields
    dimensions = [d for d in [row['length_mm'], row['width_mm'], row['depth_mm'], row['height_mm'], row['diameter_mm']] if not math.isnan(d)]
    
    # If there's only 2 dimensions...
    if len(dimensions) == 2:
        # Set a default height of 1 for items with only 2 dimensions
        dimensions.append(1)
        
    # If there's 3 or more dimensions, multiple the first 3 together
    if len(dimensions) >= 3:
        volume = dimensions[0] * dimensions[1] * dimensions[2]
    else:
        volume = 0
    return volume

df_extent['volume'] = df_extent.apply(calculate_volume, axis=1)
In [32]:
print('Total length of objects is {:.2f} km'.format(df_extent['length_mm'].sum() / 1000 / 1000))
Total length of objects is 15.38 km
In [33]:
print('Total weight of objects is {:.2f} tonnes'.format(df_extent['weight_g'].sum() / 1000000))
Total weight of objects is 197.16 tonnes
In [34]:
print('Total volume of objects is {:.2f} m\N{SUPERSCRIPT THREE}'.format(df_extent['volume'].sum() / 1000000000))
Total volume of objects is 2911.19 m³

The biggest object?

What's the biggest thing?

In [35]:
# Get the object with the largest volume
biggest = df_extent.loc[df_extent['volume'].idxmax()]

# Create a link to Collection Explorer
display(HTML('<a href="http://collectionsearch.nma.gov.au/?object={}">{}</a>'.format(biggest['id'], biggest['title'])))

Created by Tim Sherratt for the GLAM Workbench.

Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab.