Extract some aggregated data from the complete harvest¶

The complete harvest of records in the Trove People & Organisations zone is very large – more than 1.3 million records, almost 9gb of data. To do some analysis of its content, we'll extract some aggregate totals by looping through all the EAC-CPF records. This is quite slow, but memory efficient.

If you haven't created your own harvest, you'll need to download mine from CloudStor and unzip it in the current directory.

We'll extract the following information from the harvest:

recordids – just a list of record identifiers, we should already have these, but by extracting them we can check that the harvest contains what we were expecting!
entity types – the total number of records for each entity type (eg. 'Person')
sources – the total number of records for each data source
source groups – records often aggregate information from multiple data sources, to explore the overlaps between sources we'll save the total number of records for each unique combination of data sources
occupations – not all sources provide information on occupations, but it's an interesting way of exploring the records
agencies – to help interpret the source data we'll harvest a complete list of data source names and identifiers

In [1]:

import json
from pathlib import Path

import pandas as pd
from bs4 import BeautifulSoup

How many records in the harvest?¶

One way of finding this out is to simply count the number of lines in peau-data.xml. It's relatively quick.

In [2]:

%%time

# This is actually pretty quick
with open("peau-data-20230123.xml") as f:
    print(sum(1 for line in f))

1309339
CPU times: user 3.25 s, sys: 1.36 s, total: 4.61 s
Wall time: 4.61 s

That seems about right – a blank search in the Trove web interface returns about the same number (remember records have probably been added since I harvested the data).

Extract aggregate data¶

Now we're going to work through the complete dataset one record at a time, extracting some summary information. This is quite slow and could take 40-50 minutes.

In [3]:

def increment_value(values, value):
    if value in values:
        values[value] += 1
    else:
        values[value] = 1


def process_xml(xml):
    """
    Process a single EAC-CPF record extracting some basic information.
    """
    soup = BeautifulSoup(xml, "xml")
    # Converting theBS nav strings to strings saves a lot of memory
    recordids.append(str(soup.find("recordId").string))
    # Save entity type -- one of these per record
    entity_type = str(soup.find("entityType").string)
    increment_value(entity_types, entity_type)
    # Save occupations
    local_occs = []
    for occ in soup.find_all("occupation"):
        local_occs.append(str(occ.string))
    for occ in list(set(local_occs)):
        increment_value(occupations, occ)
    # Save sources
    local_sources = []
    for source in soup.find_all("agencyCode"):
        agency_id = str(source.string)
        # Combine LA ids
        if agency_id == "AU-AuCNLKIN":
            agency_id = "AuCNLKIN"
        local_sources.append(agency_id)
    for source in list(set(local_sources)):
        increment_value(sources, source)
    # Remove system source
    local_sources.remove("AU-ANL:PEAU")
    # Save source combination by joining agency ids in a pipe-separated string
    source_group = "|".join(sorted(list(set(local_sources))))
    increment_value(source_groups, source_group)
    # Save agency details
    for agency in soup.find_all("maintenanceAgency"):
        agency_id = str(agency.find("agencyCode").string)
        agency_name = str(agency.find("agencyName").string)
        if agency_id not in agencies:
            agencies[agency_id] = agency_name
    soup.decompose()

In [4]:

%%time

entity_types = {}
occupations = {}
sources = {}
source_groups = {}
recordids = []
agencies = {}

with Path("peau-data-20230123.xml").open("r") as xml_file:
    for i, xml in enumerate(xml_file):
        # if i < 1000:
        process_xml(xml)

CPU times: user 55min 19s, sys: 3.26 s, total: 55min 22s
Wall time: 55min 22s

Once we've extracted the data we can check that the number of record ids extracted corresponds to the number of lines in the dataset.

In [5]:

# How many recordids? Should be the same as above.
len(recordids)

Out[5]:

It's possible that some duplicate records might have snuck into the dataset. Let's check by looking at the number of unique record ids.

In [6]:

# How many unique recordids? Should be the same as above.
len(set(recordids))

Out[6]:

Entity types¶

What types of records are there?

In [7]:

df_types = pd.DataFrame(
    [{"entity_type": k, "total": v} for k, v in entity_types.items()]
)
df_types.style.format(thousands=",").hide()

Out[7]:

entity_type	total
person	1,085,416
corporateBody	223,193
family	730

Families? There's no mention of families in the Trove web interface. This would be interesting to explore further.

Sources¶

Where has the data come from? See below for a list of sources with the full agency names added.

In [8]:

df_sources = pd.DataFrame([{"agency_id": k, "total": v} for k, v in sources.items()])
df_sources

Out[8]:

	agency_id	total
0	AuCNLKIN	998929
1	AU-ANL:PEAU	1309339
2	AU-SAUS	173307
3	AU-ANU:ADBO	13433
4	AU-AIAS	48360
...	...	...
65	OCLC-SUDOC	1
66	OCLC-EGAXA	2
67	TO-DO	2
68	OCLC-VIAF:TEST	3
69	OCLC-NLIara	1

70 rows × 2 columns

Source groups¶

In [9]:

# Add a field with the number of sources in the group
df_source_groups = pd.DataFrame(
    [
        {"source_group": k, "number_of_sources": len(k.split("|")), "total": v}
        for k, v in source_groups.items()
    ]
)
df_source_groups.sort_values("total", ascending=False).head().style.hide().format(
    thousands=","
)

Out[9]:

source_group	number_of_sources	total
AuCNLKIN	1	941,859
AU-SAUS	1	167,827
AU-QPRO	1	57,143
AU-AIAS\|AuCNLKIN	2	36,574
AU-YORCID	1	23,145

The data includes groups where there's only one source. Let's exclude them and look at the top 25 source combinations. See the intersections notebook for more examination of the overlaps between data sources.

In [10]:

df_source_groups.loc[
    df_source_groups["source_group"].str.contains("|", regex=False)
].sort_values("total", ascending=False)[:25].style.hide().format(thousands=",")

Out[10]:

source_group	number_of_sources	total
AU-AIAS\|AuCNLKIN	2	36,574
AU-SAUS\|AuCNLKIN	2	4,018
AU-NUN:DAAO\|AuCNLKIN	2	2,635
AU-ANU:ADBO\|AuCNLKIN	2	2,144
AU-VU:EOAS\|AuCNLKIN	2	1,977
AU-YORCID\|AuCNLKIN	2	1,299
AU-VU:AWR\|AuCNLKIN	2	852
AU-ANU:ADBO\|AU-VU:EOAS	2	727
AU-ANU:ADBO\|AU-VU:EOAS\|AuCNLKIN	3	649
AU-ANU:ADBO\|AU-ANU:OA	2	522
AU-VPRO\|AuCNLKIN	2	483
AU-VU\|AuCNLKIN	2	443
AU-ANU:OA\|AuCNLKIN	2	440
AU-AIAS\|AU-SAUS\|AuCNLKIN	3	377
AU-APAR\|AuCNLKIN	2	375
AU-ANU:ADBO\|AU-NUN:DAAO\|AuCNLKIN	3	317
AU-NAMC\|AuCNLKIN	2	276
AU-ANU:ADBO\|AU-VU:AWR	2	272
AU-ANU:ADBO\|AU-ANU:OA\|AuCNLKIN	3	256
AU-NMUS:CAN\|AuCNLKIN	2	249
AU-AIAS\|AU-NUN:DAAO\|AuCNLKIN	3	181
AU-ANU:ADBO\|AU-NUN:DAAO	2	175
AU-ANU:ADBO\|AU-SAUS\|AuCNLKIN	3	164
AU-AIAS\|AU-ANU:ADBO\|AuCNLKIN	3	151
AU-ANL:MA-DM\|AU-NAMO	2	108

For more exploration of sources and source groups, see the intersections notebook.

Occupations¶

Let's look at the top 25 occupations (remembering that not all data sources provide information about occupations). The prevalence of performing artists suggests that a lot of this data is coming from AusStage.

In [11]:

df_occupations = pd.DataFrame(
    [{"occupation": k, "total": v} for k, v in occupations.items()]
)
df_occupations.sort_values("total", ascending=False)[:25].style.hide().format(
    thousands=","
)

Out[11]:

occupation	total
Actor	74,781
None	17,249
Performer	15,202
Dancer	11,411
Director	9,955
Singer	8,116
Actor and Singer	6,496
Playwright	6,226
Musician	5,686
Composer	5,639
Painter	5,229
Writer	5,164
Stage Manager	4,023
Choreographer	3,621
Producer	3,589
Designer	3,257
Author	3,195
Politician	3,055
Lighting Designer	2,877
Costume Designer	2,759
Chorus	2,633
Set Designer	2,583
Authors	2,334
Photographer	2,273
Musical Director	2,178

Agency details¶

This provides a list of the agencies, or data sources, contributing to the People and Organisations zone.

In [12]:

df_agencies = pd.DataFrame(
    [{"agency_id": k, "agency_name": v} for k, v in agencies.items()]
)
df_agencies

Out[12]:

	agency_id	agency_name
0	AU-ANL:PEAU	National Library of Australia Party Infrastruc...
1	AuCNLKIN	Libraries Australia
2	AU-SAUS	AusStage
3	AU-ANU:ADBO	Australian Dictionary of Biography
4	AU-AIAS	AIATSIS Aboriginal Biographical Index
...	...	...
66	OCLC-JPG	JPG
67	OCLC-RERO	RERO
68	TO-DO	The University of Examples, Australia
69	OCLC-VIAF:TEST	VIAF: The Virtual International Authority File
70	OCLC-NLIara	NLIara

71 rows × 2 columns

Add agency names to sources data¶

By combining the agencies data with the sources data we can add the names of agencies supplying the data to the sources list.

In [13]:

df_sources = pd.merge(df_sources, df_agencies, how="inner", on="agency_id")
df_sources = df_sources[["agency_id", "agency_name", "total"]]
df_sources.sort_values("total", ascending=False).style.hide().format(thousands=",")

Out[13]:

agency_id	agency_name	total
AU-ANL:PEAU	National Library of Australia Party Infrastructure	1,309,339
AuCNLKIN	Libraries Australia	998,929
AU-SAUS	AusStage	173,307
AU-QPRO	The Prosecution Project	57,214
AU-AIAS	AIATSIS Aboriginal Biographical Index	48,360
AU-YORCID	ORCID	24,998
AU-NUN:DAAO	Design & Art Australia Online	17,003
AU-ANU:ADBO	Australian Dictionary of Biography	13,433
AU-VU:EOAS	Encyclopedia of Australian Science	8,259
AU-ANU:OA	Obituaries Australia	8,115
AU-VU:AWR	The Australian Women's Register	6,699
AU-VU	University of Melbourne	2,966
AU-VPRO	Public Records Office Victoria	2,727
AU-NAMO	Australian Music Online	2,170
AU-APAR	Australian Parliamentary Library	1,825
AU-NMUS:CAN	Collections Australia Network	1,692
AU-QGU	Griffith University	1,601
AU-NSAL	Sydney's Aldermen	1,048
AU-APC:WB	Australian Paralympic Committee	790
AU-NAMC	Australian Music Centre	601
AU-QJCU	James Cook University, Australia	457
AU-ANL:MA-DM	destra Media	410
AU-APAR:S	Department of the Senate	328
AU-ANL:AD	Australia Dancing	325
AU-AMG	GeoScience Australia	260
AU-ANL:MA	Music Australia	255
AU-QUT	Queensland University of Technology	202
AU-VANDS	Australian Research Institutions	199
AU-VDU	Deakin University, Australia	193
AU-NUWS	University of Western Sydney	172
AU-NTSM	University of Technology, Sydney	128
AU-WS:AUS	AuScope	126
AU-QU	The University of Queensland	117
AU-SFU:PDM	Flinders University	116
AU-NWU	University of Wollongong	114
AU-SUSA	University of South Australia	94
AU-VASD	Australian Sound Design	94
AU-VSWT	Swinburne University of Technology	90
AU-TU	University of Tasmania, Australia	76
AU-QUT:GP	Queensland University of Technology	55
AU-VLU	La Trobe University	49
AU-NU	The University of Sydney, Australia	40
AU-WU	The University of Western Australia	39
AU-NNCU	University of Newcastle	37
AU-SUA	The University of Adelaide	35
OCLC-VIAF	AIATSIS Aboriginal Biographical Index	21
AU-NMQU	Macquarie University, Australia	18
OCLC-NLA	NLA	14
OCLC-LC	LC	13
AU-ANU	Australian National University	9
OCLC-BNF	BNF	8
AU-VAFI	RMIT University	8
OCLC-DNB	DNB	8
AU-ANL:PO	National Library of Australia People and Organisations	5
OCLC-BNE	BNE	5
OCLC-NKC	NKC	4
OCLC-JPG	JPG	3
OCLC-NUKAT	NUKAT	3
OCLC-VIAF:TEST	VIAF: The Virtual International Authority File	3
OCLC-RERO	RERO	2
TO-DO	The University of Examples, Australia	2
OCLC-EGAXA	EGAXA	2
OCLC-NLIlat	NLIlat	2
AU-NUNE	The University of New England	2
OCLC-LAC	LAC	2
OCLC-PTBNP	PTBNP	2
OCLC-BAV	BAV	1
OCLC-SUDOC	SUDOC	1
OCLC-SELIBR	SELIBR	1
OCLC-NLIara	NLIara	1

Save extracted data¶

In [14]:

df_sources.to_csv("peau_sources.csv", index=False)
df_source_groups.to_csv("peau_source_groups.csv", index=False)
df_occupations.to_csv("peau_occupations.csv", index=False)
df_types.to_csv("peau_types.csv", index=False)

In [15]:

with Path("peau_ids.txt").open("w") as txt_file:
    for rid in recordids:
        txt_file.write(f"{rid}\n")

In [16]:

with open("peau_agencies.json", "w") as json_file:
    json.dump(agencies, json_file)

Created by Tim Sherratt for the GLAM Workbench.

The development of this notebook was supported by the Australian Cultural Data Engine.