Get a list of Trove newspapers that doesn't include government gazettes

The Trove API includes an option to retrieve details of digitised newspaper titles. Version 2 of the API added a separate option to get details of government gazettes. However the original newspaper/titles requests actually returns both the newspaper and gazette titles, so there's no way of getting just the newspaper titles. This notebook explains the problem and provides a simple workaround.

In [1]:
import requests
import pandas as pd

Add your Trove API key below.

In [ ]:
api_key = 'YOUR API KEY GOES HERE'
print('Your API key is: {}'.format(api_key))

The problem

Getting a list of digitised newspapers or gazettes in Trove is easy, you just fire off a request to one of these endpoints:

  • https://api.trove.nla.gov.au/v2/newspaper/titles/
  • https://api.trove.nla.gov.au/v2/gazette/titles/

Let's create a function to get either the newspaper or gazette results.

In [3]:
def get_titles_df(title_type):
    # Set default params
    params = {
        'key': api_key,
        'encoding': 'json',
    }
    
    # Make the request to the titles endpoint and get the JSON data
    data = requests.get('https://api.trove.nla.gov.au/v2/{}/titles'.format(title_type), params=params).json()
    titles = []
    
    # Loop through the title records, saving the name and id
    for title in data['response']['records']['newspaper']:
        titles.append({'title': title['title'], 'id': int(title['id'])})
        
    # Convert to a dataframe
    df = pd.DataFrame(titles)
    return df

Let's use the function to get all the newspaper titles.

In [4]:
newspapers_df = get_titles_df('newspaper')
newspapers_df.head()
Out[4]:
title id
0 Canberra Community News (ACT : 1925 - 1927) 166
1 Canberra Illustrated: A Quarterly Magazine (AC... 165
2 Federal Capital Pioneer (Canberra, ACT : 1924 ... 69
3 Good Neighbour (ACT : 1950 - 1969) 871
4 Student Notes/Canberra University College Stud... 665

How many are there?

In [5]:
newspapers_df.shape
Out[5]:
(1567, 2)

Everything looks ok, but if we search inside the results for titles that include the word 'Gazette' we find that the government gazettes are all included.

In [6]:
newspapers_df.loc[newspapers_df['title'].str.contains('Gazette')][:20]
Out[6]:
title id
12 Papua New Guinea Government Gazette (1971 - 1975) 1372
17 Territory of Papua and New Guinea Government G... 1371
18 Territory of Papua Government Gazette (Papua N... 1369
19 Territory of Papua-New Guinea Government Gazet... 1370
21 Australian Government Gazette (National : 1973... 1288
22 Australian Government Gazette. Chemical (Natio... 1355
23 Australian Government Gazette. General (Nation... 1289
24 Australian Government Gazette. Periodic (Natio... 1294
25 Australian Government Gazette. Public Service ... 1308
26 Australian Government Gazette. Special (Nation... 1286
27 Commonwealth of Australia Gazette (National : ... 1214
28 Commonwealth of Australia Gazette. Agricultura... 1363
29 Commonwealth of Australia Gazette. Australian ... 1358
30 Commonwealth of Australia Gazette. Australian ... 1360
31 Commonwealth of Australia Gazette. Australian ... 1361
32 Commonwealth of Australia Gazette. Australian ... 1351
33 Commonwealth of Australia Gazette. Australian ... 1350
34 Commonwealth of Australia Gazette. Australian ... 1356
35 Commonwealth of Australia Gazette. Australian ... 1357
36 Commonwealth of Australia Gazette. Business (N... 1343

The solution

We can't just filter the results on the word 'Gazette' as a number of newspapers also include the word in their titles. Instead, we'll get a list of the gazettes using the gazette/titles endpoint and subtract these titles from the list of newspapers.

Let's get the gazettes.

In [7]:
gazettes_df = get_titles_df('gazette')
gazettes_df.head()
Out[7]:
title id
0 Papua New Guinea Government Gazette (1971 - 1975) 1372
1 Territory of Papua and New Guinea Government G... 1371
2 Territory of Papua Government Gazette (Papua N... 1369
3 Territory of Papua-New Guinea Government Gazet... 1370
4 Australian Government Gazette (National : 1973... 1288
In [8]:
gazettes_df.shape
Out[8]:
(37, 2)

Now we'll create a new dataframe that only includes titles from df_newspapers if they are not in df_gazettes.

In [9]:
newspapers_not_gazettes_df = newspapers_df[~newspapers_df['id'].isin(gazettes_df['id'])]
In [10]:
newspapers_not_gazettes_df.shape
Out[10]:
(1530, 2)

If it worked properly the number of titles in the new dataframe should equal the number in the newspapers dataframe minus the number in the gazettes dataframe.

In [11]:
newspapers_not_gazettes_df.shape[0] == newspapers_df.shape[0] - gazettes_df.shape[0]
Out[11]:
True

Yay!


Created by Tim Sherratt for the GLAM Workbench.