Examining all files in RecordSearch with the access status of 'Closed'

(Harvested on 1 January 2021)

This notebook attempts some large-scale analysis of files from the National Archives of Australia's RecordSearch database that have the access status of 'closed'. For a previous attempt at this, see Closed Access. For more background, see my Inside Story article from 2018.

See this notebook for the code used to harvest the data and create the CSV dataset.

In [150]:
import pandas as pd
import altair as alt
import datetime

The harvested data has been saved as a CSV file. First we'll open it up using Pandas.

In [231]:
df2020 = pd.read_csv('data/closed-20210101.csv', parse_dates=['contents_start_date', 'contents_end_date', 'access_decision_date'], keep_default_na=False)
df2020.head()
Out[231]:
identifier series control_symbol title contents_date_str contents_start_date contents_end_date location access_status access_decision_date_str access_decision_date reasons
0 65454 A373 5117 [Home Office suspect index] 1946 - 1950 1946-01-01 1950-01-01 Canberra Closed 26 Nov 1980 1980-11-26 33(1)(b)
1 66910 A432 1929/394 ATTACHMENT 2 Abrahams. Opinions. 1927 - 1928 1927-01-01 1928-01-01 Canberra Closed 30 Jul 2018 2018-07-30 33(3)(a)(i)|33(3)(b)
2 66911 A432 1929/394 ATTACHMENT 3 Legal opinions expressed by O Dixon, E Gorman ... 1928 - 1928 1928-01-01 1928-01-01 Canberra Closed 30 Jul 2018 2018-07-30 33(3)(a)(i)|33(3)(b)
3 99746 A471 49941 [THOMAS Leslie Hector (Leading Aircraftman) : ... 1943 - 1943 1943-01-01 1943-01-01 Canberra Closed 20 May 1999 1999-05-20 33(1)(g)
4 103094 A518 FJ118/6 Nauru Census 1952 1952 - 1953 1952-01-01 1953-01-01 Canberra Closed 24 Oct 1989 1989-10-24 33(1)(d)|33(1)(g)

How many closed files are there?

In [232]:
df2020.shape[0]
Out[232]:
11140

What series do the 'closed' files come from?

First let's see how many different series are represented in the data set.

In [233]:
df2020['series'].unique().shape[0]
Out[233]:
686

Now let's look at the 25 most common series.

In [234]:
df2020['series'].value_counts()[:25]
Out[234]:
K60        1671
A1838      1571
A13147      581
A6122       322
AWM54       309
A1209       293
A9737       232
B26         196
A1533       173
D4082       162
B73         158
A6135       154
E72         151
F1          126
AWM239      118
A432        114
PP946/1     112
A7452       103
A7324        85
C4384        76
C139         76
A3092        73
A2539        72
A1200        72
D1915        70
Name: series, dtype: int64

Series A1838 is familiar to anyone who's looked into the NAA's access examination process. It's a general correspondence series from DFAT, and requests for access tend to take a long time to be processed. Series K60 contains repatriation files from the Department of Veterans' Affairs, so these will often been withheld on privacy grounds. We'll see more about both of these below.

Let's chart the results.

In [235]:
# This creates a compact dataset to feed to Altair for charting
# We could make Altair do all the work, but that would embed a lot of data in the notebook.

# Save the series counts to a new dataframe.
series_counts = df2020['series'].value_counts().to_frame().reset_index()
series_counts.columns = ['series', 'count']
In [236]:
# Chart the results, sorted by number of files
alt.Chart(series_counts[:50]).mark_bar().encode(
    x=alt.X('series', sort='-y'),
    y=alt.Y('count', title='number of files'),
    tooltip=['series', 'count']
)
Out[236]:

This is only the top 50 of 686 series, so quite obviously there's a very long tail of series that have a small number of closed files.

What reasons are given for closing the files?

Section 33 of the Archives Act defines a number of 'exemptions' – these are reasons why files should not be opened to public access. These reasons are recorded in RecordSearch, so we can explore why files have been closed. It's a little complicated, however, because multiple exemptions can be applied to a single file. The CSV data file records multiple reasons as a pipe-separated string. First we can look at the most common combinations of reasons.

In [237]:
df2020['reasons'].value_counts()[:25]
Out[237]:
33(1)(g)                                      4573
Withheld pending adv                          3398
Parliament Class A                            1285
33(1)(a)                                       497
33(1)(a)|33(1)(b)                              260
Closed period                                  214
33(1)(d)|33(1)(g)                              149
33(1)(a)|33(1)(d)|33(1)(g)                     120
Non Cwlth-no appeal                             54
33(1)(a)|33(1)(b)|Withheld pending adv          53
33(1)(d)                                        50
                                                49
33(1)(a)|Withheld pending adv                   42
33(1)(a)|33(1)(d)|33(1)(e)(ii)|33(1)(g)         27
33(1)(e)(ii)                                    27
33(1)(a)|33(1)(d)|33(1)(e)(i)|33(1)(g)          25
33(1)(e)(ii)|33(1)(g)                           25
33(3)(a)(i)|33(3)(b)|33(3)(a)(ii)|33(3)(b)      24
33(1)(d)|33(1)(e)(iii)|33(1)(g)                 19
33(2)(a)|33(2)(b)                               15
33(1)(b)                                        15
NRF                                             14
33(3)(a)(i)|33(3)(b)                            13
Court records                                    9
33(1)(a)|33(1)(b)|33(1)(e)(iii)                  9
Name: reasons, dtype: int64

It's probably more useful, however, to look at the frequency of individual reasons. So we'll split the pip-separated string and create a row for each file/reason combination.

In [238]:
df2020_reasons = df2020.copy()
# Split the reasons field on pipe symbol |. This turns the string into a list of values.
df2020_reasons['reason'] = df2020_reasons['reasons'].str.split('|')
In [239]:
# Now we'll explode the list into separate rows.
df2020_reasons = df2020_reasons.explode('reason')

Now we can look at the frequency of individual reasons. Not, of course, that the sum of the reasons will be greater than the number of files, as some files have multiple exemptions applied to them.

In [240]:
df2020_reasons['reason'].value_counts()
Out[240]:
33(1)(g)                5003
Withheld pending adv    3524
Parliament Class A      1286
33(1)(a)                1096
33(1)(d)                 429
33(1)(b)                 362
Closed period            239
33(1)(e)(ii)             110
33(3)(b)                  66
Non Cwlth-no appeal       60
                          49
33(1)(e)(iii)             46
33(3)(a)(i)               38
33(1)(e)(i)               37
33(1)(j)                  30
33(3)(a)(ii)              28
33(2)(a)                  24
33(2)(b)                  24
NRF                       15
MAKE YOUR SELECTION       12
Non Cwlth-depositor       10
Court records              9
33(1)(f)(i)                7
33(1)(f)(ii)               5
33(1)(f)(iii)              4
33(1)(h)                   4
33(1)(c)                   3
Destroyed                  2
Name: reason, dtype: int64

The reasons starting with '33' are clauses in section 33 of the Archives Act. You can look up the Act to find out more about them, or look at this list on the NAA website. Some of the reasons, such as 'Parliament Class A' refer to particular types of records that are not subject to the same public access arrangements as other government records. Others, such as 'MAKE YOUR SELECTION' seem to be products of the data entry system!

Looking at the other most common reasons:

  • 33(1)(g) relates to privacy
  • 'Withheld pending adv' is applied to files that are undergoing access examination and have been referred to the relevant government agency for advice on whether they should be released to the public. This is not a final determination – these files may or may not end up being closed. But, as any researcher knows, this process can be very slow.
  • 33(1)(a) is the national security catch-all

You might also notice that there's a blank line in the list above. This is because some closed files have no reasons recorded in RecordSearch. We can check this.

In [241]:
missing_reasons = df2020.loc[df2020['reasons'] == '']
missing_reasons.shape[0]
Out[241]:
49

There are 46 closed files with no reason recorded. Here's a sample.

In [242]:
missing_reasons.head()
Out[242]:
identifier series control_symbol title contents_date_str contents_start_date contents_end_date location access_status access_decision_date_str access_decision_date reasons
612 546435 A1838 919/13/4 PART 31 France - Disarmament - Nuclear Weapons Testing 1972 - 1972 1972-01-01 1972-01-01 Canberra Closed 07 Oct 2016 2016-10-07
661 548392 A1838 3127/3/4 South Korea - Labour 1959 - 1968 1959-01-01 1968-01-01 Canberra Closed 11 Dec 2014 2014-12-11
862 567499 A1838 563/2/16 PART 9 Radio Australia - Technical - Foreign broadcas... 1961 - 1961 1961-01-01 1961-01-01 Canberra Closed 29 Aug 2012 2012-08-29
1351 733600 AWM239 178 [RAN Medical Officers' journals] PENGUIN (1 Ap... 1945 - 1945 1945-01-01 1945-01-01 Australian War Memorial Closed 14 Apr 2003 2003-04-14
1560 853102 AWM103 R478/1/147 [Headquarters, 1st Australian Task Force (HQ 1... 1970 - 1970 1970-01-01 1970-01-01 Australian War Memorial Closed 22 Jun 2009 2009-06-22

Let's change the missing reasons to 'None recorded' to make it easier to see what's going on.

In [243]:
df2020_reasons['reason'].replace('', 'None recorded', inplace=True)

Let's chart the frequency of the different reasons.

In [244]:
# Once again we'll create a compact dataset for charting
reason_counts = df2020_reasons['reason'].value_counts().to_frame().reset_index()
reason_counts.columns = ['reason', 'count']
In [245]:
# Make the Chart
alt.Chart(reason_counts).mark_bar().encode(
    x='reason',
    y=alt.Y('count', title='number of files'),
    tooltip=['reason', 'count']
)
Out[245]:

Connecting reasons and series

It would be interesting to bring together the analyses above and see how reasons are distributed across series. First we need to reshape our dataset to show combinations of series and reasons.

In [246]:
# Group files by series and reason, then count the number of combinations
series_reasons_counts = df2020_reasons.groupby(by=['series', 'reason']).size().reset_index()
# Rename columns
series_reasons_counts.columns = ['series', 'reason', 'count']

Now we can chart the results. Once again we'll show the number of files in the 50 most common series, but this time we'll highlight the reasons using color.

In [250]:
alt.Chart(series_reasons_counts).transform_aggregate(
    count='sum(count)',
    groupby=['series', 'reason']
# Sort by number of files
).transform_window( 
    rank='rank(count)',
    sort=[alt.SortField('count', order='descending')]
# Get the top 50
).transform_filter(
    alt.datum.rank < 50
).mark_bar().encode(
    x=alt.X('series', sort='-y'),
    y=alt.Y('sum(count)', title='number of files'),
    color='reason',
    tooltip=['series', 'reason', 'count']
)
Out[250]:

Now we can see that the distribution of reasons varies considerably across series.

How old are these files?

You would think that the sensitivity of material in closed files diminishes over time. However, there's no automatic re-assessment or time limit on 'closed' files. They stay closed until someone asks for them to be re-examined. That means that some of these files can be quite old. How old? We can use the contents end date to explore this.

In [255]:
# Normalise contents end values as end of year
df2020['contents_end_year'] = df2020['contents_end_date'].apply(lambda x: datetime.datetime(x.year, 12, 31))
In [256]:
date_counts = df2020['contents_end_year'].value_counts().to_frame().reset_index()
date_counts.columns = ['end_date', 'count']
In [257]:
alt.Chart(date_counts).mark_bar().encode(
    x='year(end_date):T',
    y='count'
).properties(width=700)
Out[257]:
In [258]:
alt.Chart(date_counts.loc[date_counts['end_date'] > '1890-12-31']).mark_bar().encode(
    x='year(end_date):T',
    y='count',
    tooltip='year(end_date)'
).properties(width=700)
Out[258]:
In [259]:
df2020['years_old'] = df2020['contents_end_year'].apply(lambda x: round((datetime.datetime.now() - x).days / 365))
In [270]:
age_counts = df2020['years_old'].value_counts().to_frame().reset_index()
age_counts.columns = ['age', 'count']
alt.Chart(age_counts.loc[age_counts['age'] < 130]).mark_bar().encode(
    x=alt.X('age:Q', title='age in years'),
    y=alt.Y('count', title='number of files'),
    tooltip=['age', 'count']
).properties(width=700)
Out[270]:
In [215]:
df2020['years_old'].describe()
Out[215]:
count    11115.000000
mean        49.151327
std         19.387702
min          4.000000
25%         34.000000
50%         47.000000
75%         63.000000
max        220.000000
Name: years_old, dtype: float64
In [266]:
df2020.loc[df2020['reasons'].str.contains('33(1)(a)', regex=False)]['years_old'].describe()
Out[266]:
count    1096.000000
mean       61.380474
std        12.805262
min        20.000000
25%        56.000000
50%        64.000000
75%        71.000000
max        94.000000
Name: years_old, dtype: float64
In [265]:
df2020['years_old'].quantile([0.25, 0.5, 0.75]).to_list()
Out[265]:
[34.0, 47.0, 63.0]

Dates of decisions

In [117]:
df2020['year'] = df2020['access_decision_date'].dt.year
year_counts = df2020['year'].value_counts().to_frame().reset_index()
year_counts.columns = ['year', 'count']
In [121]:
alt.Chart(year_counts).mark_bar().encode(
    x='year:O',
    y='count'
)
Out[121]:

33(1)(a)

In [105]:
df331a = df2020.loc[df2020['reasons'].str.contains('33(1)(a)', regex=False)]
df331a.head()
Out[105]:
identifier series control_symbol title contents_date_str contents_start_date contents_end_date location access_status access_decision_date_str access_decision_date reasons
5 140757 A1838 3034/2/2/2 PART 7 Indonesia. Communism in Indonesia 1960 - 1962 1960-01-01 1962-01-01 Canberra Closed 14 May 2012 2012-05-14 33(1)(a)|33(1)(b)|Withheld pending adv
7 170971 A816 41/301/195 Exchange of staff between Joint Intelligence B... 1953 - 1958 1953-01-01 1958-01-01 Canberra Closed 29 Apr 1991 1991-04-29 33(1)(a)|33(1)(b)
8 171089 A816 43/302/76 Cryptographic Material for ASIO 1951 - 1952 1951-01-01 1952-01-01 Canberra Closed 11 Mar 1993 1993-03-11 33(1)(a)|33(1)(b)
9 171129 A816 44/301/219 SEATO [South East Asia Treaty Organisation] Co... 1957 - 1957 1957-01-01 1957-01-01 Canberra Closed 01 Aug 1991 1991-08-01 33(1)(a)|33(1)(b)
12 200166 A1196 29/501/225 Evasion of Customs Regulations - RAAF Station,... 1944 - 1944 1944-01-01 1944-01-01 Canberra Closed 09 Apr 1975 1975-04-09 33(1)(a)|33(1)(b)
In [107]:
series_counts_331a = df331a['series'].value_counts().to_frame().reset_index()
series_counts_331a.columns = ['series', 'count']
In [252]:
alt.Chart(series_counts_331a[:50]).mark_bar().encode(
    x=alt.X('series', sort='-y'),
    y=alt.Y('count', title='number of files'),
    tooltip=['series', 'count']
)
Out[252]:

Withheld pending advice

In [109]:
dfwh = df2020.loc[df2020['reasons'].str.contains('Withheld pending adv', regex=False)]
dfwh.head()
Out[109]:
identifier series control_symbol title contents_date_str contents_start_date contents_end_date location access_status access_decision_date_str access_decision_date reasons
5 140757 A1838 3034/2/2/2 PART 7 Indonesia. Communism in Indonesia 1960 - 1962 1960-01-01 1962-01-01 Canberra Closed 14 May 2012 2012-05-14 33(1)(a)|33(1)(b)|Withheld pending adv
10 171205 A816 48/301/131 Inland tele-radio channels in Australia, Papua... 1950 - 1953 1950-01-01 1953-01-01 Canberra Closed 16 Aug 2018 2018-08-16 Withheld pending adv
11 199284 A1196 2/501/295 Provision of Capacity for the Manufacture of n... 1952 - 1955 1952-01-01 1955-01-01 Canberra Closed 04 Mar 2020 2020-03-04 Withheld pending adv
13 200647 A1196 36/501/729 PART 2 RAAF Component of the Strategic Reserve- Execu... 1958 - 1958 1958-01-01 1958-01-01 Canberra Closed 02 Nov 2016 2016-11-02 Withheld pending adv
14 200648 A1196 36/501/729 PART 3 RAAF Component Strategic Reserve. (Execution o... 1958 - 1959 1958-01-01 1959-01-01 Canberra Closed 02 Nov 2016 2016-11-02 Withheld pending adv
In [110]:
series_counts_wh = dfwh['series'].value_counts().to_frame().reset_index()
series_counts_wh.columns = ['series', 'count']
In [111]:
alt.Chart(series_counts_wh[:50]).mark_bar().encode(
    x=alt.X('series', sort='-y'),
    y='count'
)
Out[111]:
In [ ]: