Harvest ABC Radio National records from Trove

Trove harvests details of programs and segments broadcast on ABC Radio National. You can find them by searching for nuc:"ABC:RN" in the Music & Audio category. The records include basic metadata such as titles, dates, and contributors, but not full transcripts or audio.

I harvested this metadata back in 2016, but haven't done any updates recently because there seemed to be a lot of duplicate records. I decided to just go ahead and harvest the records, and then deal with the duplicates using Pandas.

There are also quite a lot of inconsistencies in the way the data is formatted – some fields can contain a variety of arrays, strings, and objects. I've tried to standardise these as much as possible. The harvesting and cleaning is all documented below.

The harvested data is available for download from CloudStor as a 560mb JSONL file (with a separate JSON object for each record, separated by line breaks) and as a 324mb CSV file (with lists saved as pipe-separated strings).

There are 408,082 records from about 164 programs (the actual number of programs is less than this, as the names used for some programs varies). See this notebook for some examples of how you can start exploring the data.

For convenience, I've also created separate CSV files for the programs with the most records, you can download them from CloudStor:

Data fields

Any of the fields other than work_id and version_id might be empty, though in most cases there should at least be values for title, date, creator, contributor and isPartOf.

  • work_id – identifier for the containing work in Trove (you can use this to create a url to the item)
  • version_id – an identifier for the version within the work
  • title – title for the program or segment
  • isPartOf – name of the program this is a part of
  • date – ISO formatted date
  • creator – usually just the ABC
  • contributor – a list of names of those involved, such as the host, reporter or guest
  • publisher – usually just the ABC
  • rights – copyright information
  • type – list of types (not sure how this differa from format)
  • format – list of formats (not sure how this differs from type)
  • abstract – text providing a summary of the program or segment (may incude multiple values)
  • subject – list of subject tags (uncontrolled and very messy)
  • description – truncated text fragment from the start of the transcript (may include multiple values)
  • fulltext_url – link to the page on the ABC website where you can find more information
  • thumbnail_url – link to a related thumbnail image on the ABC website
  • notonline_url – not sure...

Import what we need

In [4]:
import requests
import requests_cache
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
import pandas as pd
from pathlib import Path
import jsonlines
import time
import datetime

s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[ 502, 503, 504 ])
s.mount('https://', HTTPAdapter(max_retries=retries))
s.mount('http://', HTTPAdapter(max_retries=retries))
In [5]:
TROVE_API_KEY = '[YOUR API KEY]'

Define some functions

In [6]:
def get_total(params):
    params['n'] = 0
    response = s.get('https://api.trove.nla.gov.au/v2/result', params=params)
    data = response.json()
    return int(data['response']['zone'][0]['records']['total'])

def get_metadata_source(record):
    try:
        source = record['metadataSource']['value']
    except TypeError:
        source = record['metadataSource']
    return source

def make_list(value):
    '''
    Some fields are mixed lists and strings, use this to turn them all into lists.
    '''
    if isinstance(value, list):
        return value
    else:
        return [value]
    
def unmake_list(value, sep='|'):
    '''
    If a value is a list, join the values into a string.
    '''
    if isinstance(value, list):
        return sep.join(value)
    else:
        return value
    
def extract_values(value):
    '''
    Some fields mix dicts and lists. Try to extract values from dicts and return only lists.
    '''
    values = []
    value_list = make_list(value)
    for v in value_list:
        try:
            values.append(v['value'])
        except (TypeError, KeyError):
            values.append(v)
    return values

def normalise_date(value):
    '''
    Dates can be strings, dicts, or lists.
    Try to return a single ISO date string.
    '''
    if isinstance(value, list):
        try:
            date = value[0]['value']
        except(KeyError, TypeError):
            date = sorted(value, key=len)[0]
    elif isinstance(value, dict):
        date = value['value']
    else:
        date = value
    return date[:10]

def get_links(identifiers):
    '''
    Flatten the identifiers list of dicts into a dict with linktype as key.
    '''
    links = {}
    for link in identifiers:
        try:
            links[f'{link["linktype"]}_url'] = link['value']
        except (TypeError, KeyError):
            pass
    return links    

def harvest():
    params = {
        'q': 'nuc:"ABC:RN"',
        'zone': 'music',
        'include': 'workversions',
        'n':  100,
        'bulkHarvest': 'true',
        'encoding': 'json',
        'key': TROVE_API_KEY
    }
    start = '*'
    total = get_total(params.copy())
    Path('data2').mkdir(exist_ok=True)
    with jsonlines.open(Path('data2', 'abcrn.jsonl'), mode='w') as writer:
        with tqdm(total=total) as pbar:
            while start:
                params['s'] = start
                response = s.get('https://api.trove.nla.gov.au/v2/result', params=params)
                data = response.json()
                # Loop through the work records
                records = data['response']['zone'][0]['records']['work']
                for record in records:
                    # Now loop through the version records
                    for version in record['version']:
                        # Sometimes versions can themselves contain multiple records and ids
                        # First we'll try splitting the ids in case there are multiple values
                        ids = version['id'].split()
                        # Then we'll try looping through any sub-version records
                        for i, subv in enumerate(make_list(version['record'])):
                            # Get the metadata source so we can filter out any records we don't want
                            source = get_metadata_source(subv)
                            if source == 'ABC:RN':
                                # Add work id to the record
                                subv['work_id'] = record['id']
                                # Add version id to the record
                                subv['version_id'] = ids[i]
                                # Remove space around title
                                if 'title' in subv:
                                    subv['title'] = str(subv['title']).strip()
                                # Dates can be strings, docts, or lists - normalise them!
                                if 'date' in subv:
                                    subv['date'] = normalise_date(subv['date'])
                                # Make sure these are just strings
                                if 'isPartOf' in subv:
                                    subv['isPartOf'] = unmake_list(subv['isPartOf'])
                                # Try to standardise the formats of these fields -- turn them all into simple lists
                                if 'creator' in subv:
                                    subv['creator'] = extract_values(subv['creator'])
                                if 'contributor' in subv:
                                    subv['contributor'] = extract_values(subv['contributor'])
                                if 'description' in subv:
                                    subv['description'] = extract_values(subv['description'])
                                # Make sure these are all lists
                                if 'type' in subv:
                                    subv['type'] = make_list(subv['type'])
                                if 'format' in subv:
                                    subv['format'] = make_list(subv['format'])
                                # Get links by flattening the identifiers field and add to record
                                links = get_links(subv['identifier'])
                                subv.update(links)
                                # remove unnecessary identifiers field
                                del subv['identifier']
                                writer.write(subv)
                try:
                    start = data['response']['zone'][0]['records']['nextStart']
                except KeyError:
                    start = None
                pbar.update(len(records)) 
                if not response.from_cache:
                    time.sleep(0.2)

Harvest the data!

In [ ]:
harvest()

Remove duplicate records

How many records have we harvested? Let's load the jsonl file into a dataframe and explore.

In [5]:
# The lines param tells pandas there's one JSON object per line.
df = pd.read_json(Path('data', 'abcrn.jsonl'), lines=True)
df.head()
Out[5]:
abstract isPartOf publisher rights type title creator date format language ... fulltext_url thumbnail_url rightsHolder header notonline_url spatial tableOfContents modified publisher.CorporateName coverage.postcode
0 [What politicians believe is good for women's ... ABC Radio National. Health Report Australian Broadcasting Corporation http://www.abc.net.au/conditions.htm#UseOfContent [Sound, Transcript, Radio Broadcast] RU 486 [Australian Broadcasting Corporation. Radio Na... 1997-09-22 [text/html, Transcript] [en-AU] ... http://www.abc.net.au/radionational/programs/h... http://www.abc.net.au/radionational/image/3699... NaN NaN NaN NaN NaN NaN NaN NaN
1 [There's an on-going courtroom war between cop... ABC Radio National. Law Report Australian Broadcasting Corporation http://www.abc.net.au/conditions.htm#UseOfContent [Sound, Transcript, Radio Broadcast] Copyright and the courts [Australian Broadcasting Corporation. Radio Na... 2011-05-12 [Audio, Transcript] [en-AU] ... http://www.abc.net.au/radionational/programs/l... http://www.abc.net.au/radionational/image/3699... NaN NaN NaN NaN NaN NaN NaN NaN
2 [Disability rights lawyer and endurance athlet... ABC Radio National. RN Breakfast Australian Broadcasting Corporation http://www.abc.net.au/conditions.htm#UseOfContent [Sound, Transcript, Radio Broadcast] The Law Report [Australian Broadcasting Corporation. Radio Na... 2014-03-25 [text/html] [en-AU] ... http://www.abc.net.au/radionational/programs/b... http://www.abc.net.au/radionational/image/3699... ABC NaN NaN NaN NaN NaN NaN NaN
3 [Professor Andrew Ashworth, one of the United ... ABC Radio National. RN Breakfast Australian Broadcasting Corporation http://www.abc.net.au/conditions.htm#UseOfContent [Sound, Transcript, Radio Broadcast] The Law Report [Australian Broadcasting Corporation. Radio Na... 2014-02-11 [text/html] [en-AU] ... http://www.abc.net.au/radionational/programs/b... http://www.abc.net.au/radionational/image/3699... ABC NaN NaN NaN NaN NaN NaN NaN
4 [What has happened in East Timor since indepen... ABC Radio National. Rear Vision Australian Broadcasting Corporation http://www.abc.net.au/conditions.htm#UseOfContent [Text, Transcript, Radio Broadcast] East Timor Since Independence [Australian Broadcasting Corporation. Radio Na... 2006-06-29 [Audio] [en-AU] ... http://www.abc.net.au/radionational/programs/r... http://www.abc.net.au/radionational/image/3699... NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 26 columns

In [6]:
df.shape
Out[6]:
(470817, 26)

However, there are quite a lot of duplicates in the data. You'd expect the combination of title, date, and program (in the isPartOf field) to be unique – let's see.

In [7]:
df.loc[df.duplicated(subset=('title', 'date', 'isPartOf'))].shape
Out[7]:
(62735, 26)

Let's remove the duplicates based on the title, date, and isPartOf fields. By adding fulltext_url to the sort, I'm hoping to drop the duplicates without urls (by default drop_duplicates keeps the first version of a duplicated record).

In [8]:
df = df.sort_values(by=['title', 'date', 'fulltext_url']).drop_duplicates(subset=['title', 'date', 'isPartOf'])

Now how many do we have?

In [9]:
df.shape
Out[9]:
(408082, 26)

Clean and reshape

Here I'll drop some unnecessary columns and reorder those that remain.

In [10]:
df = df[['work_id', 'version_id', 'title', 'isPartOf', 'date', 'creator', 'contributor', 'publisher', 'rights', 'type', 'format', 'abstract', 'subject', 'description', 'fulltext_url', 'thumbnail_url', 'notonline_url']]
In [11]:
df.head()
Out[11]:
work_id version_id title isPartOf date creator contributor publisher rights type format abstract subject description fulltext_url thumbnail_url notonline_url
249156 192773792 211006911 ABC Radio. AM Archive 2000-03-01 [Australian Broadcasting Corporation. News] [Denise Knight] Australian Broadcasting Corporation NaN [Radio Broadcast] [Transcript] [A key US cabinet minister Energy Secretary Bi... NaN [COMPERE: A key US cabinet minister, Energy Se... http://www.abc.net.au/am/stories/s104577.htm http://www.abc.net.au/am/rss/AM-100x100.jpg NaN
249709 192780593 211016874 ABC Radio. The World Today 2001-03-15 [Australian Broadcasting Corporation. News] NaN Australian Broadcasting Corporation NaN [Radio Broadcast] NaN [China has delivered its toughest warning yet ... NaN NaN http://www.abc.net.au/worldtoday/stories/s2608... http://www.abc.net.au/worldtoday/rss/TWT-1400x... NaN
30897 188214807 204817860 ABC Radio. AM 2009-09-14 [Australian Broadcasting Corporation. News] [John Shovelan, Tony Eastley] Australian Broadcasting Corporation NaN [Radio Broadcast] [Audio, Transcript, Transcript] [The White House has dismissed suggestions tha... [English Defence League, Barack Obama, Joe Wil... [TONY EASTLEY: The White House has dismissed s... http://www.abc.net.au/am/content/2009/s2684727... http://www.abc.net.au/am/rss/AM-100x100.jpg NaN
373461 242731122 270536229 " Bringing the Spirits Home: the first stolen ... ABC Radio National. AWAYE! 2003-08-07 NaN [Rhoda Roberts, Lorena Allam] Australian Broadcasting Corporation https://www.abc.net.au/conditions.htm#UseOfCon... [Transcript, Sound] [text/html] ["Bringing the Spirits Home: the first stolen ... NaN NaN https://www.abc.net.au/radionational/programs/... https://www.abc.net.au/cm/rimage/8135856-1x1-l... NaN
28043 188211972 204814278 " Bringing the Spirits Home: the first stolen ... ABC Radio National. AWAYE! 2004-06-04 [Australian Broadcasting Corporation. Radio Na... [Rhoda Roberts, Lorena Allam] Australian Broadcasting Corporation http://www.abc.net.au/conditions.htm#UseOfContent [Sound, Transcript, Radio Broadcast] [text/html] [Bringing the Spirits Home: the first stolen g... NaN ["Bringing the Spirits Home: the first stolen ... http://www.abc.net.au/radionational/programs/a... http://www.abc.net.au/radionational/image/3699... NaN

Flatten lists for CSV export

There are a few fields that have a series of values in a list (like subject). Here we'll join them into pipe (|) delimited strings to make them easier to export as CSVs.

In [12]:
cols_with_lists = ['creator', 'contributor', 'type', 'format', 'abstract', 'subject', 'description']
for col in cols_with_lists:
    df[col] = df[col].str.join('|')
In [ ]:
csv_file = Path('data', f'abcrn-{datetime.date.today().isoformat()}.csv')
df.to_csv(csv_file, index=False)

Create CSV downloads for individual programs

Here's the programs with the most records. Note that some like RN Breakfast are split between two isPartOf values, 'ABC Radio National. RN Breakfast' and 'ABC Radio. RN Breakfast'.

In [16]:
df['isPartOf'].value_counts()[:20]
Out[16]:
ABC Radio National. RN Breakfast        59036
ABC Radio. AM                           55538
ABC Radio. The World Today              51253
ABC Radio. PM                           50853
ABC Radio. RN Breakfast                 19877
ABC Radio. RN Drive                     12759
ABC Radio National. RN Drive            11850
ABC Radio National. Life Matters        10657
ABC Radio National. Late Night Live      9904
ABC Radio. AM Archive                    9825
ABC Radio. PM Archive                    8430
ABC Radio. The World Today Archive       7902
ABC Radio National. The Science Show     6182
ABC Radio National. Saturday Extra       5218
ABC Radio                                4612
ABC Radio National. Counterpoint         4049
ABC Radio. Correspondents Report         3924
ABC Radio National. Health Report        3557
ABC Radio National. Sunday Extra         3451
ABC Radio National. AWAYE!               3311
Name: isPartOf, dtype: int64

Let's save the programs with the most records as separate CSV files to make them a bit easier to work with. We'll also group together programs with multiple isPartOf values.

In [ ]:
programs = {
    'breakfast': ['ABC Radio National. RN Breakfast', 'ABC Radio. RN Breakfast'],
    'am': ['ABC Radio. AM', 'ABC Radio. AM Archive'],
    'pm': ['ABC Radio. PM', 'ABC Radio. PM Archive'],
    'world_today': ['ABC Radio. The World Today', 'ABC Radio. The World Today Archive'],
    'drive': ['ABC Radio. RN Drive', 'ABC Radio National. RN Drive'],
    'latenight': ['ABC Radio National. Late Night Live'],
    'lifematters': ['ABC Radio National. Life Matters'],
    'scienceshow': ['ABC Radio National. The Science Show']
}

for program, labels in programs.items():
    dfp = df.loc[df['isPartOf'].isin(labels)].sort_values(by=['date', 'title'])
    csv_file = Path('data', f'{program}-{datetime.date.today().isoformat()}.csv')
    dfp.to_csv(csv_file, index=False)

Created by Tim Sherratt for the GLAM Workbench

In [ ]: