Create an index to the harvested files

The XML files contain embedded metadata that includes the name of the prime minister, and the title and date of the transcript. This notebook extracts that metadata from the harvested files and creates a CSV formatted spreadsheet for easy analysis. It also demonstrates some ways of summarising and visualising the metadata.

In [113]:
import os
from bs4 import BeautifulSoup
import arrow
import pandas as pd
import altair as alt

# Set up Altair

Extract the metadata as save as a CSV

In [114]:
def get_tag(soup, tag):
    Given some Soup, find the specified tag and return its value.
        value = soup.find(tag).string.strip()
    except AttributeError:
        value = ''
    return value

# Create a list to put the metadata in
all_details = []

# Get the file names of all the harvested files
files = [f for f in os.listdir('transcripts') if f[-4:] == '.xml']

# Loop through the harvested files
for filename in files:
    # Open the file
    with open(os.path.join('transcripts', filename), 'rb') as xml_file:
        # Create a dict to put this file's metadata in
        details = {}
        # Load the file contents into Soup and then get the desired tags
        soup = BeautifulSoup(
        details['id'] = get_tag(soup, 'transcript-id')
        details['title'] = get_tag(soup, 'title')
        details['pm'] = get_tag(soup, 'prime-minister')
        # We're going to reformat the date into the ISO standard, so first get the value
        release_date = get_tag(soup, 'release-date')
            # Then try to parse the date and reformat as ISO
            iso_date = arrow.get(release_date, 'DD/MM/YYYY').format('YYYY-MM-DD')
            # If something goes wrong...
            iso_date = ''
        details['date'] = iso_date
        details['release_type'] = get_tag(soup, 'release-type')
        details['subjects'] = get_tag(soup, 'subjects')
        details['pdf'] = get_tag(soup, 'document')
        # Add the metadata for this file to the list
In [115]:
# Convert the metadata to a Pandas dataframe
df = pd.DataFrame(all_details)
date id pdf pm release_type subjects title
0 2014-08-24 23765 Abbott, Tony Media Release A message from the Prime Minister - Building a...
1 1968-03-14 1797 Gorton, John Statement in Parliament STATEMENT BY THE PRIME MINISTER THE RT.HON. JO...
2 2016-11-17 40598 Turnbull, Malcolm Transcript Press Conference at the launch of the Veterans...
3 1978-09-27 4837 Fraser, Malcolm Media Release GRANT TO WORLD WILDLIFE FUND AUSTRALIA
4 2004-03-24 21172 Howard, John Interview Doorstop Interview Great Hall, Parliament Hous...

Save the index as a CSV formatted file.

In [116]:
# Save the metadata as a CSV file
df.to_csv('index.csv', index=False)

Analyse the metadata

In [117]:
# This loads the metadata from the CSV
# This is not necessary if you've just generated the df
df = pd.read_csv('index.csv', keep_default_na=False)

Let's have a look at the number of transcripts for each Prime Minister.

In [118]:
Howard, John         5865
Hawke, Robert        2321
Fraser, Malcolm      2081
Gillard, Julia       2072
Turnbull, Malcolm    1751
Rudd, Kevin          1735
Keating, Paul        1582
Abbott, Tony         1371
Whitlam, Gough       1238
Menzies, Robert      1212
Gorton, John          625
Holt, Harold          507
McMahon, William      349
McEwen, John           16
Chifley, Ben           11
Curtin, John            4
Name: pm, dtype: int64

Note that there are 73 transcripts that have no Prime Minister associated with them. We'll exclude those and chart the rest.

In [112]:
# We're exlusing those transcripts where the 'pm' value is empty - df.loc[df['pm'] != '']
alt.Chart(df.loc[df['pm'] != '']).mark_bar().encode(
    # Show number of transcripts on X axis
    x=alt.X('count():Q', title='Number of transcripts'),
    # Show the name of the PM on the Y axis
    y=alt.Y('pm:N', title='Prime Minister'),
    # Show details when you hover
    tooltip=[alt.Tooltip('pm:N', title='Prime Minister'), alt.Tooltip('count():Q', title='Transcripts')]

Let's have a look at the number of transcripts over time.

In [119]:
alt.Chart(df.loc[df['pm'] != '']).mark_bar(size=6).encode(
    # Year on the X axis
    x=alt.X('year(date):T', title='Year'),
    # Number of transcripts on the Y axis
    y=alt.Y('count():Q', title='Number of transcripts'),
    # Color denotes the PM (names will be in the legend)
    color=alt.Color('pm:N', scale=alt.Scale(scheme='tableau20'), legend=alt.Legend(title='Prime Minister')),
    # Show details on hover
    tooltip=[alt.Tooltip('pm:N', title='Prime Minister'), alt.Tooltip('year(date):T', title='Year'), alt.Tooltip('count():Q', title='Transcripts')]

Release types and subjects

The release_type field should tell us what sort of text this transcript represents – interview, media release, etc. Let's see how it's used.

In [96]:
Media Release                            8686
Interview                                5352
Speech                                   4988
Transcript                               1746
Press Conference                          745
Statement in Parliament                   341
Video Transcript                          233
Correspondence                            131
Broadcast                                 111
Message                                    54
Index                                      47
Statement                                  44
Doorstop                                   41
Government                                 36
Foreign Affairs                            16
Economy & Finance                          14
Education                                  14
Arts, Culture & Sport                      13
Environment                                12
Health                                     12
Communique                                 10
Letter                                     10
Business & Industry                         9
Article                                     9
Defence                                     8
Website Updates                             7
Blog Transcript                             6
Remarks                                     5
Security, Law and Justice                   5
Honours, Commemorations & Condolences       5
Press Statement                             4
Report                                      3
Chat Transcript                             3
Joint Statement                             3
Indigenous Affairs                          3
Immigration                                 2
Foreword                                    2
Employment                                  2
Emergency Management                        2
Trade                                       2
Energy Efficiency                           1
Programme                                   1
Regional Australia                          1
Budget                                      1
?                                           1
Communications                              1
Agriculture                                 1
Digital Economy                             1
Broadband                                   1
Memo                                        1
Climate Change                              1
Workplace Relations                         1
School Education                            1
Financial Services                          1
Social Services                             1
Infrastructure & Transport                  1
Name: release_type, dtype: int64

Hmmm, so it looks like there's not a lot of control over the values in this field – is a 'Press Statement' the same as a 'Media Release'? It also looks like some subjects have been incorrectly entered here.

By combining groupby() and value_counts() it's easy to look at the release types for each PM.

In [76]:
pm                 release_type           
                   Interview                    39
                   Media Release                14
                   Speech                       13
                   Doorstop                      1
                   Press Conference              1
Abbott, Tony       Transcript                  620
                   Media Release               450
                   Speech                      245
                   Remarks                       5
                   Interview                     2
                   Press Conference              2
                   Press Statement               2
                   Statement                     2
                   Doorstop                      1
                   Message                       1
Chifley, Ben       Speech                       11
Curtin, John       Speech                        4
Fraser, Malcolm    Media Release              1303
                   Speech                      446
                   Interview                   183
                   Statement in Parliament      53
                   Correspondence               51
                   Press Conference             42
                   ?                             1
                   Communique                    1
                   Report                        1
Gillard, Julia     Media Release               842
                   Interview                   589
Menzies, Robert    Article                       8
                   Foreword                      2
                   Correspondence                1
                   Memo                          1
                   Programme                     1
Rudd, Kevin        Interview                   697
                   Media Release               604
                   Speech                      329
                   Video Transcript             74
                   Doorstop                      9
                   Website Updates               7
                   Blog Transcript               6
                   Press Conference              6
                   Chat Transcript               3
Turnbull, Malcolm  Transcript                 1124
                   Media Release               550
                   Speech                       65
                   Press Statement               1
Whitlam, Gough     Media Release               790
                   Speech                      245
                   Press Conference             89
                   Broadcast                    49
                   Interview                    32
                   Statement in Parliament      21
                   Statement                     7
                   Communique                    2
                   Article                       1
                   Message                       1
                   Report                        1
Name: release_type, Length: 163, dtype: int64

Because the release_type field is a bit of a mess, let's get the top ten release types and visualise them for each PM.

In [94]:
# The index returned by value_counts() are the names of the release types, so we can convert to a list and slice off the first 10.
top_ten_types = df.loc[df['release_type'] != '']['release_type'].value_counts().index.to_list()[:10]
In [95]:
# We're exluding those transcripts where the 'pm' value is empty - df.loc[df['pm'] != '']
alt.Chart(df.loc[(df['pm'] != '') & (df['release_type'].isin(top_ten_types))]).mark_bar().encode(
In [99]:
df.loc[df['subjects'] != ''].shape
(1122, 7)
In [101]:
In [98]:
Same-sex marriage                                                                                                                                                                                                                                                                                                  14
Budget 2014.                                                                                                                                                                                                                                                                                                       14
East Timor                                                                                                                                                                                                                                                                                                         13
Bali tragedy                                                                                                                                                                                                                                                                                                        7
Budget 2014                                                                                                                                                                                                                                                                                                         5
National Energy Guarantee                                                                                                                                                                                                                                                                                           5
Bali tragedy.                                                                                                                                                                                                                                                                                                       4
Same sex marriage                                                                                                                                                                                                                                                                                                   3
Malaysia Airlines tragedy.                                                                                                                                                                                                                                                                                          3
Citizenship                                                                                                                                                                                                                                                                                                         3
Federal Budget                                                                                                                                                                                                                                                                                                      3
Australian Defence Force contribution to international coalition against ISIL,  visit to Arnhem Land,  constitutional recognition for the first Australians.                                                                                                                                                        3
Malaysia Airlines tragedy                                                                                                                                                                                                                                                                                           3
Meeting of the Council of Australian Governments.                                                                                                                                                                                                                                                                   2
G20 volunteer launch,  G20 Summit,  Malaysia Airlines flight MH17,  World infrastructure hub,  ACOSS report,  school curriculum,  Ebola,  China’s coal tariff,  Brisbane transport video.                                                                                                                         2
Anzac Day commemorations.                                                                                                                                                                                                                                                                                           2
Qantas.                                                                                                                                                                                                                                                                                                             2
G20                                                                                                                                                                                                                                                                                                                 2
Same-sex marriage survey                                                                                                                                                                                                                                                                                            2
Zimbabwe                                                                                                                                                                                                                                                                                                            2
Malaysia Airlines Flight MH17.                                                                                                                                                                                                                                                                                      2
Budget 2015.                                                                                                                                                                                                                                                                                                        2
Victorian infrastructure, sentencing, same-sex marriage                                                                                                                                                                                                                                                             2
International supply mission to Iraq                                                                                                                                                                                                                                                                                2
Malaysia Airlines Flight MH17                                                                                                                                                                                                                                                                                       2
The Budget                                                                                                                                                                                                                                                                                                          2
Malaysia Airlines Flight MH17,  Operation Sovereign Borders.                                                                                                                                                                                                                                                        2
Budget 2015                                                                                                                                                                                                                                                                                                         2
Joint Counter Terrorism Team Operation.                                                                                                                                                                                                                                                                             2
Deployment of Australian troops to East Timor                                                                                                                                                                                                                                                                       1
Prime Minister’s visit to the Torres Strait,  constitutional recognition for the first Australians,  income tax.                                                                                                                                                                                                  1
Official opening of the Devondale Murray Goulburn Dairy Beverages Centre,  China-Australia Free Trade Agreement negotiations,  new Senate,  Operation Sovereign Borders,  Competition Policy Review, Commonwealth Bank Review.                                                                                      1
Visit to the USA,  Iraq,  direct action plan to reduce carbon emissions,  Socceroos’ World Cup campaign.                                                                                                                                                                                                          1
Mark Waugh; trade links                                                                                                                                                                                                                                                                                             1
Energy prices, airport security, terrorist attacks in Indonesia, 2018 Federal Budget, immigration, pre-selections, protests in Gaza and the Royal Wedding                                                                                                                                                           1
Terrorist Attack in Turkey                                                                                                                                                                                                                                                                                          1
Jobs, economic growth, trade, Labor failure                                                                                                                                                                                                                                                                         1
Great Barrier Reef announcement; National Energy Guarantee                                                                                                                                                                                                                                                          1
Same-sex marriage; 2018; Citizenship; Nationals                                                                                                                                                                                                                                                                     1
New South Wales bushfires,  disaster recovery payments.                                                                                                                                                                                                                                                             1
the Federal Government’s commitment to grow small business,  the Australian automotive industry,  the Federal Government’s commitment to repeal the carbon tax,  intelligence agencies,  Operation Sovereign Borders,  live cattle exports,  parliamentary entitlemen                                           1
Grand Gateway Exchange opens ahead of schedule,  Syrian humanitarian crisis,  Canning by-election,  Andrew Hastie,  domestic violence.                                                                                                                                                                              1
Company tax cuts; Electricity Prices; Leadership                                                                                                                                                                                                                                                                    1
Royal Flying Doctor Service; Regional health; Water management; Indigenous incarceration rates                                                                                                                                                                                                                      1
East Timor, Steve Pratt and Peter Wallace, Branko Jelen, Defence capabilty, China and Taiwan relations, Olympic tickets, Driza-bone, Austudy, memorial for Reverend John Flynn, defence spending, Victoria election, buy Australia campaign, business tax reform, republic referendum, Rafter at the US Open        1
Referendum result; Preamble result.                                                                                                                                                                                                                                                                                 1
Antarctic icebreaker                                                                                                                                                                                                                                                                                                1
the Federal Government’s commitment to repeal the carbon tax,  Labor’s debt legacy,  Operation Sovereign Borders,  Indonesia.                                                                                                                                                                                   1
Knights of the Order of Australia,  the Government’s plans for 2015,  Death of His Royal Highness King Abdullah bin Abdulaziz al-Saud of the Kingdom of Saudi Arabia.                                                                                                                                             1
Export deals; Defence Industry; Australian jobs; Banks; Polls                                                                                                                                                                                                                                                       1
Economy, Jobs, Growth, Trade,                                                                                                                                                                                                                                                                                       1
People smuggling cooperation with Sri Lanka,  CHOGM,  Productivity Commission Inquiry to focus on more flexible, affordable and accessible child care,  GrainCorp.                                                                                                                                                  1
Australian Cyber Security Centre; delivering lower energy prices for families and businesses; wages growth; business tax cuts                                                                                                                                                                                       1
Great Barrier Reef; water policy; back to school                                                                                                                                                                                                                                                                    1
East Asia Summit 2013,  Australia-Japan relationship,  Trans-Pacific Partnership,  minor parties,  entitlements.                                                                                                                                                                                                    1
Stronger economy,  Maurice Newman,  resources,  Operation Sovereign Borders,  superannuation,  housing sector,  IPCC’s report,  university places,  Parliamentary sittings,  minor parties.                                                                                                                       1
Funding agreement to fast-track construction of WestConnex,  building the roads of the 21st century for New South Wales,  Budget 2014.                                                                                                                                                                              1
School funding; Higher education funding; GST distribution                                                                                                                                                                                                                                                          1
Senator Fraser Anning                                                                                                                                                                                                                                                                                               1
Name: subjects, Length: 1036, dtype: int64
In [ ]:
In [ ]: