Get a list of Trove newspapers that doesn't include government gazettes¶

The Trove API includes an option to retrieve details of digitised newspaper titles. Version 2 of the API added a separate option to get details of government gazettes. However the original newspaper/titles requests actually returns both the newspaper and gazette titles, so there's no way of getting just the newspaper titles. This notebook explains the problem and provides a simple workaround.

In [1]:

import os

import pandas as pd
import requests

In [2]:

%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv

Add your Trove API key below.

In [3]:

# Insert your Trove API key
API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

The problem¶

Getting a list of digitised newspapers or gazettes in Trove is easy, you just fire off a request to one of these endpoints:

https://api.trove.nla.gov.au/v2/newspaper/titles/
https://api.trove.nla.gov.au/v2/gazette/titles/

Let's create a function to get either the newspaper or gazette results.

In [4]:

def get_titles_df(title_type):
    # Set default params
    params = {
        "key": API_KEY,
        "encoding": "json",
    }

    # Make the request to the titles endpoint and get the JSON data
    data = requests.get(
        "https://api.trove.nla.gov.au/v2/{}/titles".format(title_type), params=params
    ).json()
    titles = []

    # Loop through the title records, saving the name and id
    for title in data["response"]["records"]["newspaper"]:
        titles.append({"title": title["title"], "id": int(title["id"])})

    # Convert to a dataframe
    df = pd.DataFrame(titles)
    return df

Let's use the function to get all the newspaper titles.

In [5]:

newspapers_df = get_titles_df("newspaper")
newspapers_df.head()

Out[5]:

	title	id
0	Canberra Community News (ACT : 1925 - 1927)	166
1	Canberra Illustrated: A Quarterly Magazine (AC...	165
2	Federal Capital Pioneer (Canberra, ACT : 1924 ...	69
3	Good Neighbour (ACT : 1950 - 1969)	871
4	Student Notes/Canberra University College Stud...	665

How many are there?

In [6]:

newspapers_df.shape

Out[6]:

(1717, 2)

Everything looks ok, but if we search inside the results for titles that include the word 'Gazette' we find that the government gazettes are all included.

In [7]:

newspapers_df.loc[newspapers_df["title"].str.contains("Gazette")][:20]

Out[7]:

	title	id
13	Papua New Guinea Government Gazette (1971 - 1975)	1372
21	Territory of Papua and New Guinea Government G...	1371
22	Territory of Papua Government Gazette (Papua N...	1369
23	Territory of Papua-New Guinea Government Gazet...	1370
28	Australian Government Gazette (National : 1973...	1288
29	Australian Government Gazette. Chemical (Natio...	1355
30	Australian Government Gazette. General (Nation...	1289
31	Australian Government Gazette. Periodic (Natio...	1294
32	Australian Government Gazette. Public Service ...	1308
33	Australian Government Gazette. Special (Nation...	1286
34	Commonwealth of Australia Gazette (National : ...	1214
35	Commonwealth of Australia Gazette. Agricultura...	1363
36	Commonwealth of Australia Gazette. Australian ...	1358
37	Commonwealth of Australia Gazette. Australian ...	1360
38	Commonwealth of Australia Gazette. Australian ...	1361
39	Commonwealth of Australia Gazette. Australian ...	1351
40	Commonwealth of Australia Gazette. Australian ...	1350
41	Commonwealth of Australia Gazette. Australian ...	1356
42	Commonwealth of Australia Gazette. Australian ...	1357
43	Commonwealth of Australia Gazette. Business (N...	1343

The solution¶

We can't just filter the results on the word 'Gazette' as a number of newspapers also include the word in their titles. Instead, we'll get a list of the gazettes using the gazette/titles endpoint and subtract these titles from the list of newspapers.

Let's get the gazettes.

In [8]:

gazettes_df = get_titles_df("gazette")
gazettes_df.head()

Out[8]:

	title	id
0	Administration Order (Nauru : 1921 - 1926)	1571
1	Papua New Guinea Government Gazette (1971 - 1975)	1372
2	Territory of Papua and New Guinea Government G...	1371
3	Territory of Papua Government Gazette (Papua N...	1369
4	Territory of Papua-New Guinea Government Gazet...	1370

In [9]:

gazettes_df.shape

Out[9]:

(38, 2)

Now we'll create a new dataframe that only includes titles from df_newspapers if they are not in df_gazettes.

In [10]:

newspapers_not_gazettes_df = newspapers_df[~newspapers_df["id"].isin(gazettes_df["id"])]

In [11]:

newspapers_not_gazettes_df.shape

Out[11]:

(1679, 2)

If it worked properly the number of titles in the new dataframe should equal the number in the newspapers dataframe minus the number in the gazettes dataframe.

In [12]:

newspapers_not_gazettes_df.shape[0] == newspapers_df.shape[0] - gazettes_df.shape[0]

Out[12]:

True

Yay!

Created by Tim Sherratt for the GLAM Workbench.
Support this project by becoming a GitHub sponsor.