The Trove API includes an option to retrieve details of digitised newspaper titles. Version 2 of the API added a separate option to get details of government gazettes. However the original newspaper/titles
requests actually returns both the newspaper and gazette titles, so there's no way of getting just the newspaper titles. This notebook explains the problem and provides a simple workaround.
import requests
import pandas as pd
Add your Trove API key below.
api_key = 'YOUR API KEY GOES HERE'
print('Your API key is: {}'.format(api_key))
Getting a list of digitised newspapers or gazettes in Trove is easy, you just fire off a request to one of these endpoints:
https://api.trove.nla.gov.au/v2/newspaper/titles/
https://api.trove.nla.gov.au/v2/gazette/titles/
Let's create a function to get either the newspaper
or gazette
results.
def get_titles_df(title_type):
# Set default params
params = {
'key': api_key,
'encoding': 'json',
}
# Make the request to the titles endpoint and get the JSON data
data = requests.get('https://api.trove.nla.gov.au/v2/{}/titles'.format(title_type), params=params).json()
titles = []
# Loop through the title records, saving the name and id
for title in data['response']['records']['newspaper']:
titles.append({'title': title['title'], 'id': int(title['id'])})
# Convert to a dataframe
df = pd.DataFrame(titles)
return df
Let's use the function to get all the newspaper titles.
newspapers_df = get_titles_df('newspaper')
newspapers_df.head()
How many are there?
newspapers_df.shape
Everything looks ok, but if we search inside the results for titles that include the word 'Gazette' we find that the government gazettes are all included.
newspapers_df.loc[newspapers_df['title'].str.contains('Gazette')][:20]
We can't just filter the results on the word 'Gazette' as a number of newspapers also include the word in their titles. Instead, we'll get a list of the gazettes using the gazette/titles
endpoint and subtract these titles from the list of newspapers.
Let's get the gazettes.
gazettes_df = get_titles_df('gazette')
gazettes_df.head()
gazettes_df.shape
Now we'll create a new dataframe that only includes titles from df_newspapers
if they are not in df_gazettes
.
newspapers_not_gazettes_df = newspapers_df[~newspapers_df['id'].isin(gazettes_df['id'])]
newspapers_not_gazettes_df.shape
If it worked properly the number of titles in the new dataframe should equal the number in the newspapers dataframe minus the number in the gazettes dataframe.
newspapers_not_gazettes_df.shape[0] == newspapers_df.shape[0] - gazettes_df.shape[0]
Yay!
Created by Tim Sherratt for the GLAM Workbench.