The complete harvest of records in the Trove People & Organisations zone is very large – more than 1.3 million records, almost 9gb of data. To do some analysis of its content, we'll extract some aggregate totals by looping through all the EAC-CPF records. This is quite slow, but memory efficient.
If you haven't created your own harvest, you'll need to download mine from CloudStor and unzip it in the current directory.
We'll extract the following information from the harvest:
recordids
– just a list of record identifiers, we should already have these, but by extracting them we can check that the harvest contains what we were expecting!entity types
– the total number of records for each entity type (eg. 'Person')sources
– the total number of records for each data sourcesource groups
– records often aggregate information from multiple data sources, to explore the overlaps between sources we'll save the total number of records for each unique combination of data sourcesoccupations
– not all sources provide information on occupations, but it's an interesting way of exploring the recordsagencies
– to help interpret the source data we'll harvest a complete list of data source names and identifiersimport json
from pathlib import Path
import pandas as pd
from bs4 import BeautifulSoup
One way of finding this out is to simply count the number of lines in peau-data.xml
. It's relatively quick.
%%time
# This is actually pretty quick
with open("peau-data-20230123.xml") as f:
print(sum(1 for line in f))
1309339 CPU times: user 3.25 s, sys: 1.36 s, total: 4.61 s Wall time: 4.61 s
That seems about right – a blank search in the Trove web interface returns about the same number (remember records have probably been added since I harvested the data).
Now we're going to work through the complete dataset one record at a time, extracting some summary information. This is quite slow and could take 40-50 minutes.
def increment_value(values, value):
if value in values:
values[value] += 1
else:
values[value] = 1
def process_xml(xml):
"""
Process a single EAC-CPF record extracting some basic information.
"""
soup = BeautifulSoup(xml, "xml")
# Converting theBS nav strings to strings saves a lot of memory
recordids.append(str(soup.find("recordId").string))
# Save entity type -- one of these per record
entity_type = str(soup.find("entityType").string)
increment_value(entity_types, entity_type)
# Save occupations
local_occs = []
for occ in soup.find_all("occupation"):
local_occs.append(str(occ.string))
for occ in list(set(local_occs)):
increment_value(occupations, occ)
# Save sources
local_sources = []
for source in soup.find_all("agencyCode"):
agency_id = str(source.string)
# Combine LA ids
if agency_id == "AU-AuCNLKIN":
agency_id = "AuCNLKIN"
local_sources.append(agency_id)
for source in list(set(local_sources)):
increment_value(sources, source)
# Remove system source
local_sources.remove("AU-ANL:PEAU")
# Save source combination by joining agency ids in a pipe-separated string
source_group = "|".join(sorted(list(set(local_sources))))
increment_value(source_groups, source_group)
# Save agency details
for agency in soup.find_all("maintenanceAgency"):
agency_id = str(agency.find("agencyCode").string)
agency_name = str(agency.find("agencyName").string)
if agency_id not in agencies:
agencies[agency_id] = agency_name
soup.decompose()
%%time
entity_types = {}
occupations = {}
sources = {}
source_groups = {}
recordids = []
agencies = {}
with Path("peau-data-20230123.xml").open("r") as xml_file:
for i, xml in enumerate(xml_file):
# if i < 1000:
process_xml(xml)
CPU times: user 55min 19s, sys: 3.26 s, total: 55min 22s Wall time: 55min 22s
Once we've extracted the data we can check that the number of record ids extracted corresponds to the number of lines in the dataset.
# How many recordids? Should be the same as above.
len(recordids)
1309339
It's possible that some duplicate records might have snuck into the dataset. Let's check by looking at the number of unique record ids.
# How many unique recordids? Should be the same as above.
len(set(recordids))
1309339
What types of records are there?
df_types = pd.DataFrame(
[{"entity_type": k, "total": v} for k, v in entity_types.items()]
)
df_types.style.format(thousands=",").hide()
entity_type | total |
---|---|
person | 1,085,416 |
corporateBody | 223,193 |
family | 730 |
Families? There's no mention of families in the Trove web interface. This would be interesting to explore further.
Where has the data come from? See below for a list of sources with the full agency names added.
df_sources = pd.DataFrame([{"agency_id": k, "total": v} for k, v in sources.items()])
df_sources
agency_id | total | |
---|---|---|
0 | AuCNLKIN | 998929 |
1 | AU-ANL:PEAU | 1309339 |
2 | AU-SAUS | 173307 |
3 | AU-ANU:ADBO | 13433 |
4 | AU-AIAS | 48360 |
... | ... | ... |
65 | OCLC-SUDOC | 1 |
66 | OCLC-EGAXA | 2 |
67 | TO-DO | 2 |
68 | OCLC-VIAF:TEST | 3 |
69 | OCLC-NLIara | 1 |
70 rows × 2 columns
# Add a field with the number of sources in the group
df_source_groups = pd.DataFrame(
[
{"source_group": k, "number_of_sources": len(k.split("|")), "total": v}
for k, v in source_groups.items()
]
)
df_source_groups.sort_values("total", ascending=False).head().style.hide().format(
thousands=","
)
source_group | number_of_sources | total |
---|---|---|
AuCNLKIN | 1 | 941,859 |
AU-SAUS | 1 | 167,827 |
AU-QPRO | 1 | 57,143 |
AU-AIAS|AuCNLKIN | 2 | 36,574 |
AU-YORCID | 1 | 23,145 |
The data includes groups where there's only one source. Let's exclude them and look at the top 25 source combinations. See the intersections notebook for more examination of the overlaps between data sources.
df_source_groups.loc[
df_source_groups["source_group"].str.contains("|", regex=False)
].sort_values("total", ascending=False)[:25].style.hide().format(thousands=",")
source_group | number_of_sources | total |
---|---|---|
AU-AIAS|AuCNLKIN | 2 | 36,574 |
AU-SAUS|AuCNLKIN | 2 | 4,018 |
AU-NUN:DAAO|AuCNLKIN | 2 | 2,635 |
AU-ANU:ADBO|AuCNLKIN | 2 | 2,144 |
AU-VU:EOAS|AuCNLKIN | 2 | 1,977 |
AU-YORCID|AuCNLKIN | 2 | 1,299 |
AU-VU:AWR|AuCNLKIN | 2 | 852 |
AU-ANU:ADBO|AU-VU:EOAS | 2 | 727 |
AU-ANU:ADBO|AU-VU:EOAS|AuCNLKIN | 3 | 649 |
AU-ANU:ADBO|AU-ANU:OA | 2 | 522 |
AU-VPRO|AuCNLKIN | 2 | 483 |
AU-VU|AuCNLKIN | 2 | 443 |
AU-ANU:OA|AuCNLKIN | 2 | 440 |
AU-AIAS|AU-SAUS|AuCNLKIN | 3 | 377 |
AU-APAR|AuCNLKIN | 2 | 375 |
AU-ANU:ADBO|AU-NUN:DAAO|AuCNLKIN | 3 | 317 |
AU-NAMC|AuCNLKIN | 2 | 276 |
AU-ANU:ADBO|AU-VU:AWR | 2 | 272 |
AU-ANU:ADBO|AU-ANU:OA|AuCNLKIN | 3 | 256 |
AU-NMUS:CAN|AuCNLKIN | 2 | 249 |
AU-AIAS|AU-NUN:DAAO|AuCNLKIN | 3 | 181 |
AU-ANU:ADBO|AU-NUN:DAAO | 2 | 175 |
AU-ANU:ADBO|AU-SAUS|AuCNLKIN | 3 | 164 |
AU-AIAS|AU-ANU:ADBO|AuCNLKIN | 3 | 151 |
AU-ANL:MA-DM|AU-NAMO | 2 | 108 |
For more exploration of sources and source groups, see the intersections notebook.
Let's look at the top 25 occupations (remembering that not all data sources provide information about occupations). The prevalence of performing artists suggests that a lot of this data is coming from AusStage.
df_occupations = pd.DataFrame(
[{"occupation": k, "total": v} for k, v in occupations.items()]
)
df_occupations.sort_values("total", ascending=False)[:25].style.hide().format(
thousands=","
)
occupation | total |
---|---|
Actor | 74,781 |
None | 17,249 |
Performer | 15,202 |
Dancer | 11,411 |
Director | 9,955 |
Singer | 8,116 |
Actor and Singer | 6,496 |
Playwright | 6,226 |
Musician | 5,686 |
Composer | 5,639 |
Painter | 5,229 |
Writer | 5,164 |
Stage Manager | 4,023 |
Choreographer | 3,621 |
Producer | 3,589 |
Designer | 3,257 |
Author | 3,195 |
Politician | 3,055 |
Lighting Designer | 2,877 |
Costume Designer | 2,759 |
Chorus | 2,633 |
Set Designer | 2,583 |
Authors | 2,334 |
Photographer | 2,273 |
Musical Director | 2,178 |
This provides a list of the agencies, or data sources, contributing to the People and Organisations zone.
df_agencies = pd.DataFrame(
[{"agency_id": k, "agency_name": v} for k, v in agencies.items()]
)
df_agencies
agency_id | agency_name | |
---|---|---|
0 | AU-ANL:PEAU | National Library of Australia Party Infrastruc... |
1 | AuCNLKIN | Libraries Australia |
2 | AU-SAUS | AusStage |
3 | AU-ANU:ADBO | Australian Dictionary of Biography |
4 | AU-AIAS | AIATSIS Aboriginal Biographical Index |
... | ... | ... |
66 | OCLC-JPG | JPG |
67 | OCLC-RERO | RERO |
68 | TO-DO | The University of Examples, Australia |
69 | OCLC-VIAF:TEST | VIAF: The Virtual International Authority File |
70 | OCLC-NLIara | NLIara |
71 rows × 2 columns
By combining the agencies
data with the sources
data we can add the names of agencies supplying the data to the sources list.
df_sources = pd.merge(df_sources, df_agencies, how="inner", on="agency_id")
df_sources = df_sources[["agency_id", "agency_name", "total"]]
df_sources.sort_values("total", ascending=False).style.hide().format(thousands=",")
agency_id | agency_name | total |
---|---|---|
AU-ANL:PEAU | National Library of Australia Party Infrastructure | 1,309,339 |
AuCNLKIN | Libraries Australia | 998,929 |
AU-SAUS | AusStage | 173,307 |
AU-QPRO | The Prosecution Project | 57,214 |
AU-AIAS | AIATSIS Aboriginal Biographical Index | 48,360 |
AU-YORCID | ORCID | 24,998 |
AU-NUN:DAAO | Design & Art Australia Online | 17,003 |
AU-ANU:ADBO | Australian Dictionary of Biography | 13,433 |
AU-VU:EOAS | Encyclopedia of Australian Science | 8,259 |
AU-ANU:OA | Obituaries Australia | 8,115 |
AU-VU:AWR | The Australian Women's Register | 6,699 |
AU-VU | University of Melbourne | 2,966 |
AU-VPRO | Public Records Office Victoria | 2,727 |
AU-NAMO | Australian Music Online | 2,170 |
AU-APAR | Australian Parliamentary Library | 1,825 |
AU-NMUS:CAN | Collections Australia Network | 1,692 |
AU-QGU | Griffith University | 1,601 |
AU-NSAL | Sydney's Aldermen | 1,048 |
AU-APC:WB | Australian Paralympic Committee | 790 |
AU-NAMC | Australian Music Centre | 601 |
AU-QJCU | James Cook University, Australia | 457 |
AU-ANL:MA-DM | destra Media | 410 |
AU-APAR:S | Department of the Senate | 328 |
AU-ANL:AD | Australia Dancing | 325 |
AU-AMG | GeoScience Australia | 260 |
AU-ANL:MA | Music Australia | 255 |
AU-QUT | Queensland University of Technology | 202 |
AU-VANDS | Australian Research Institutions | 199 |
AU-VDU | Deakin University, Australia | 193 |
AU-NUWS | University of Western Sydney | 172 |
AU-NTSM | University of Technology, Sydney | 128 |
AU-WS:AUS | AuScope | 126 |
AU-QU | The University of Queensland | 117 |
AU-SFU:PDM | Flinders University | 116 |
AU-NWU | University of Wollongong | 114 |
AU-SUSA | University of South Australia | 94 |
AU-VASD | Australian Sound Design | 94 |
AU-VSWT | Swinburne University of Technology | 90 |
AU-TU | University of Tasmania, Australia | 76 |
AU-QUT:GP | Queensland University of Technology | 55 |
AU-VLU | La Trobe University | 49 |
AU-NU | The University of Sydney, Australia | 40 |
AU-WU | The University of Western Australia | 39 |
AU-NNCU | University of Newcastle | 37 |
AU-SUA | The University of Adelaide | 35 |
OCLC-VIAF | AIATSIS Aboriginal Biographical Index | 21 |
AU-NMQU | Macquarie University, Australia | 18 |
OCLC-NLA | NLA | 14 |
OCLC-LC | LC | 13 |
AU-ANU | Australian National University | 9 |
OCLC-BNF | BNF | 8 |
AU-VAFI | RMIT University | 8 |
OCLC-DNB | DNB | 8 |
AU-ANL:PO | National Library of Australia People and Organisations | 5 |
OCLC-BNE | BNE | 5 |
OCLC-NKC | NKC | 4 |
OCLC-JPG | JPG | 3 |
OCLC-NUKAT | NUKAT | 3 |
OCLC-VIAF:TEST | VIAF: The Virtual International Authority File | 3 |
OCLC-RERO | RERO | 2 |
TO-DO | The University of Examples, Australia | 2 |
OCLC-EGAXA | EGAXA | 2 |
OCLC-NLIlat | NLIlat | 2 |
AU-NUNE | The University of New England | 2 |
OCLC-LAC | LAC | 2 |
OCLC-PTBNP | PTBNP | 2 |
OCLC-BAV | BAV | 1 |
OCLC-SUDOC | SUDOC | 1 |
OCLC-SELIBR | SELIBR | 1 |
OCLC-NLIara | NLIara | 1 |
df_sources.to_csv("peau_sources.csv", index=False)
df_source_groups.to_csv("peau_source_groups.csv", index=False)
df_occupations.to_csv("peau_occupations.csv", index=False)
df_types.to_csv("peau_types.csv", index=False)
with Path("peau_ids.txt").open("w") as txt_file:
for rid in recordids:
txt_file.write(f"{rid}\n")
with open("peau_agencies.json", "w") as json_file:
json.dump(agencies, json_file)
Created by Tim Sherratt for the GLAM Workbench.
The development of this notebook was supported by the Australian Cultural Data Engine.