Exploring harvested series data, May 2021¶

This notebook examines data from a complete harvest of series publicly available through RecordSearch in May 2021. See this notebook for the harvesting method.

In [47]:

import pandas as pd

In [61]:

df = pd.read_csv("series_totals_May_2021.csv")

In [62]:

df.head()

Out[62]:

	identifier	title	contents_date_str	contents_start_date	contents_end_date	quantity_total	described_note	described_total	digitised_total	access_open_total	access_nye_total	access_owe_total
0	A1	Correspondence files, annual single number ser...	01 Jan 1890 - 31 Dec 1969	1890-01-01	1969-12-31	476.33	All items from this series are entered on Reco...	64444	56643	64445	10	3
1	A2	Correspondence files, annual single number series	01 Jan 1895 - 31 Dec 1926	1895-01-01	1926-12-31	35.74	All items from this series are entered on Reco...	3409	362	3403	6	0
2	A3	Correspondence files, annual single number ser...	01 Jan 1839 - 23 May 1963	1839-01-01	1963-05-23	26.48	All items from this series are entered on Reco...	1382	263	1374	0	8
3	A4	Correspondence files, single number series wit...	NaN	NaN	NaN	0.36	Click to see items listed on RecordSearch. Ple...	45	2	45	0	0
4	A5	Correspondence files, annual single number ser...	01 Jan 1923 - 31 Dec 1924	1923-01-01	1924-12-31	1.80	Click to see items listed on RecordSearch. Ple...	200	14	198	2	0

Some basic statistics¶

Note that these numbers might not be exact. To work around the 20,000 search result limit, some totals have been calculated by aggregating a series of searches. In most cases this will be accurate, but some items have multiple control symbols and may be duplicated in the results. I think any errors will be small.

Number of series¶

In [71]:

print(f"{df.shape[0]:,} series")

65,719 series

Quantity of records in linear metres¶

In [74]:

print(f'{round(df["quantity_total"].sum(), 2):,} metres of records')

337,670.69 metres of records

Number of items described in RecordSearch¶

In [73]:

print(f'{df["described_total"].sum():,} items described')

13,910,490 items described

Number of items digitised¶

In [76]:

print(f'{df["digitised_total"].sum():,} items digitised')

2,234,232 items digitised

In [91]:

print(
    f'{df["digitised_total"].sum() / df["described_total"].sum():0.2%} of described items are digitised'
)

16.06% of described items are digitised

Access status of items described¶

In [108]:

access_totals = [
    {"access status": "Open", "total": df["access_open_total"].sum()},
    {"access status": "Open with exceptions", "total": df["access_owe_total"].sum()},
    {"access status": "Closed", "total": df["access_closed_total"].sum()},
    {"access status": "Not yet examined", "total": df["access_nye_total"].sum()},
]

df_access = pd.DataFrame(access_totals)
df_access["percent"] = df_access["total"] / df_access["total"].sum()

df_access.style.format({"total": "{:,.0f}", "percent": "{:0.2%}"})

Out[108]:

	access status	total	percent
0	Open	7,782,821	55.24%
1	Open with exceptions	109,135	0.77%
2	Closed	11,070	0.08%
3	Not yet examined	6,186,653	43.91%

Digging deeper¶

How many items are there in total?¶

There's no way of knowing this from the harvested data. However, the recently-released Tune Review says that 37% of the NAA's holdings are described. So as we know the number described, we should be able to calculate an approximate number of total items.

In [124]:

print(f'Approximately {int(df["described_total"].sum() / 0.37):,} items in total')

Approximately 37,595,918 items in total

To put that another way, this is the approximate number of items not listed on RecordSearch:

In [136]:

print(
    f'Approximately {int(df["described_total"].sum() / 0.37) - df["described_total"].sum():,} items **are not** listed on RecordSearch'
)

Approximately 23,685,428 items **are not** listed on RecordSearch

That's something to keep in mind if you're just relying on item keyword searches to find relevant content.

How much of each series is described at item level?¶

The note that accompanies the number of items listed in RecordSearch indicates how much of the series has been described at item level. By looking at the frequency of each of the values for this note, we can get a sese of the level of description across the collection.

In [109]:

df_described = df["described_note"].value_counts().to_frame()
df_described.columns = ["total"]
df_described["percent"] = df_described["total"] / df_described["total"].sum()
df_described.style.format({"total": "{:,.0f}", "percent": "{:0.2%}"})

Out[109]:

	total	percent
No items from the series are on RecordSearch. Please contact the National Reference Service if you need assistance.	41,794	63.60%
Click to see items listed on RecordSearch. Please contact the National Reference Service if you can't find the record you want as not all items from the series may be on RecordSearch.	13,032	19.83%
All items from this series are entered on RecordSearch.	10,862	16.53%
No items from the series are on RecordSearch. Please contact the Australian War Memorial if you need assistance.	30	0.05%

The numbers above might be a bit misleading because sometimes series are registered on RecordSearch before any items are actually transferred to the NAA. So the reason there are no items listed might be that there are no items currently in Archives custody. To try an get a more accurate picture, we can filter out series where the quantity held by the NAA is equal to zero metres.

In [112]:

df_described_held = (
    df.loc[df["quantity_total"] != 0]["described_note"].value_counts().to_frame()
)
df_described_held.columns = ["total"]
df_described_held["percent"] = (
    df_described_held["total"] / df_described_held["total"].sum()
)
df_described_held.style.format({"total": "{:,.0f}", "percent": "{:0.2%}"})

Out[112]:

	total	percent
No items from the series are on RecordSearch. Please contact the National Reference Service if you need assistance.	19,410	47.85%
All items from this series are entered on RecordSearch.	10,707	26.39%
Click to see items listed on RecordSearch. Please contact the National Reference Service if you can't find the record you want as not all items from the series may be on RecordSearch.	10,450	25.76%
No items from the series are on RecordSearch. Please contact the Australian War Memorial if you need assistance.	1	0.00%

This brings down the 'undescribed' proportion, though strangely this seems to indicate that there are zero shelf metres of some series which are fully described.

In [130]:

df.loc[
    (df["described_note"].str.startswith("All")) & (df["quantity_total"] == 0)
].shape[0]

Out[130]:

For example:

In [131]:

df.loc[(df["described_note"].str.startswith("All")) & (df["quantity_total"] == 0)].head(
    2
)

Out[131]:

	identifier	title	contents_date_str	contents_start_date	contents_end_date	quantity_total	described_note	described_total	digitised_total	access_open_total	access_nye_total	access_owe_total	access_closed_total
121	A123	Name index cards (Departments), 'G' series	NaN	NaN	NaN	0.0	All items from this series are entered on Reco...	2	0	2	0	0	0
742	A749	Volume of Circulars of Public Service Commissi...	NaN	NaN	NaN	0.0	All items from this series are entered on Reco...	1	0	0	1	0	0

So perhaps in some cases locations and quantities are not reliably recorded on RecordSearch.

Series with no item descriptions¶

From the items described note it seems that 19,411 series held by the NAA or AWM have no item level descriptions. We can check that by simply looking for series where the described_total value is zero.

In [117]:

print(
    f'{df.loc[(df["quantity_total"] > 0) & (df["described_total"] == 0)].shape[0]:,} series held by NAA have no item descriptions'
)

19,411 series held by NAA have no item descriptions

Yay! That matches.

Boo! That's a pretty significant black hole. Let's look at the quantity of records that represents.

In [126]:

print(
    f'{df.loc[(df["quantity_total"] > 0) & (df["described_total"] == 0)]["quantity_total"].sum():,} linear metres in series held by NAA with no item descriptions'
)

51,930.87 linear metres in series held by NAA with no item descriptions

Of course, this doesn't include the quantities of series that are partially described.

Created by Tim Sherratt for the GLAM Workbench. Support me by becoming a GitHub sponsor!