The XML files contain embedded metadata that includes the name of the prime minister, and the title and date of the transcript. This notebook extracts that metadata from the harvested files and creates a CSV formatted spreadsheet for easy analysis. It also demonstrates some ways of summarising and visualising the metadata.
import os
from bs4 import BeautifulSoup
import arrow
import pandas as pd
import altair as alt
# Set up Altair
#alt.renderers.enable('notebook')
alt.renderers.enable('default')
alt.data_transformers.enable('json')
DataTransformerRegistry.enable('json')
def get_tag(soup, tag):
'''
Given some Soup, find the specified tag and return its value.
'''
try:
value = soup.find(tag).string.strip()
except AttributeError:
value = ''
return value
# Create a list to put the metadata in
all_details = []
# Get the file names of all the harvested files
files = [f for f in os.listdir('transcripts') if f[-4:] == '.xml']
# Loop through the harvested files
for filename in files:
# Open the file
with open(os.path.join('transcripts', filename), 'rb') as xml_file:
# Create a dict to put this file's metadata in
details = {}
# Load the file contents into Soup and then get the desired tags
soup = BeautifulSoup(xml_file.read())
details['id'] = get_tag(soup, 'transcript-id')
details['title'] = get_tag(soup, 'title')
details['pm'] = get_tag(soup, 'prime-minister')
# We're going to reformat the date into the ISO standard, so first get the value
release_date = get_tag(soup, 'release-date')
try:
# Then try to parse the date and reformat as ISO
iso_date = arrow.get(release_date, 'DD/MM/YYYY').format('YYYY-MM-DD')
except:
# If something goes wrong...
iso_date = ''
details['date'] = iso_date
details['release_type'] = get_tag(soup, 'release-type')
details['subjects'] = get_tag(soup, 'subjects')
details['pdf'] = get_tag(soup, 'document')
# Add the metadata for this file to the list
all_details.append(details)
# Convert the metadata to a Pandas dataframe
df = pd.DataFrame(all_details)
df.head()
date | id | pm | release_type | subjects | title | ||
---|---|---|---|---|---|---|---|
0 | 2014-08-24 | 23765 | Abbott, Tony | Media Release | A message from the Prime Minister - Building a... | ||
1 | 1968-03-14 | 1797 | https://pmtranscripts.pmc.gov.au/sites/default... | Gorton, John | Statement in Parliament | STATEMENT BY THE PRIME MINISTER THE RT.HON. JO... | |
2 | 2016-11-17 | 40598 | Turnbull, Malcolm | Transcript | Press Conference at the launch of the Veterans... | ||
3 | 1978-09-27 | 4837 | https://pmtranscripts.pmc.gov.au/sites/default... | Fraser, Malcolm | Media Release | GRANT TO WORLD WILDLIFE FUND AUSTRALIA | |
4 | 2004-03-24 | 21172 | Howard, John | Interview | Doorstop Interview Great Hall, Parliament Hous... |
Save the index as a CSV formatted file.
# Save the metadata as a CSV file
df.to_csv('index.csv', index=False)
# This loads the metadata from the CSV
# This is not necessary if you've just generated the df
df = pd.read_csv('index.csv', keep_default_na=False)
Let's have a look at the number of transcripts for each Prime Minister.
df['pm'].value_counts()
Howard, John 5865 Hawke, Robert 2321 Fraser, Malcolm 2081 Gillard, Julia 2072 Turnbull, Malcolm 1751 Rudd, Kevin 1735 Keating, Paul 1582 Abbott, Tony 1371 Whitlam, Gough 1238 Menzies, Robert 1212 Gorton, John 625 Holt, Harold 507 McMahon, William 349 74 McEwen, John 16 Chifley, Ben 11 Curtin, John 4 Name: pm, dtype: int64
Note that there are 73 transcripts that have no Prime Minister associated with them. We'll exclude those and chart the rest.
# We're exlusing those transcripts where the 'pm' value is empty - df.loc[df['pm'] != '']
alt.Chart(df.loc[df['pm'] != '']).mark_bar().encode(
# Show number of transcripts on X axis
x=alt.X('count():Q', title='Number of transcripts'),
# Show the name of the PM on the Y axis
y=alt.Y('pm:N', title='Prime Minister'),
# Show details when you hover
tooltip=[alt.Tooltip('pm:N', title='Prime Minister'), alt.Tooltip('count():Q', title='Transcripts')]
)
Let's have a look at the number of transcripts over time.
alt.Chart(df.loc[df['pm'] != '']).mark_bar(size=6).encode(
# Year on the X axis
x=alt.X('year(date):T', title='Year'),
# Number of transcripts on the Y axis
y=alt.Y('count():Q', title='Number of transcripts'),
# Color denotes the PM (names will be in the legend)
color=alt.Color('pm:N', scale=alt.Scale(scheme='tableau20'), legend=alt.Legend(title='Prime Minister')),
# Show details on hover
tooltip=[alt.Tooltip('pm:N', title='Prime Minister'), alt.Tooltip('year(date):T', title='Year'), alt.Tooltip('count():Q', title='Transcripts')]
).properties(width=700)
The release_type
field should tell us what sort of text this transcript represents – interview, media release, etc. Let's see how it's used.
df['release_type'].value_counts()
Media Release 8686 Interview 5352 Speech 4988 Transcript 1746 Press Conference 745 Statement in Parliament 341 Video Transcript 233 Correspondence 131 Broadcast 111 62 Message 54 Index 47 Statement 44 Doorstop 41 Government 36 Foreign Affairs 16 Economy & Finance 14 Education 14 Arts, Culture & Sport 13 Environment 12 Health 12 Communique 10 Letter 10 Business & Industry 9 Article 9 Defence 8 Website Updates 7 Blog Transcript 6 Remarks 5 Security, Law and Justice 5 Honours, Commemorations & Condolences 5 Press Statement 4 Report 3 Chat Transcript 3 Joint Statement 3 Indigenous Affairs 3 Immigration 2 Foreword 2 Employment 2 Emergency Management 2 Trade 2 Energy Efficiency 1 Programme 1 Regional Australia 1 Budget 1 ? 1 Communications 1 Agriculture 1 Digital Economy 1 Broadband 1 Memo 1 Climate Change 1 Workplace Relations 1 School Education 1 Financial Services 1 Social Services 1 Infrastructure & Transport 1 Name: release_type, dtype: int64
Hmmm, so it looks like there's not a lot of control over the values in this field – is a 'Press Statement' the same as a 'Media Release'? It also looks like some subjects have been incorrectly entered here.
By combining groupby()
and value_counts()
it's easy to look at the release types for each PM.
df.groupby('pm')['release_type'].value_counts()
pm release_type Interview 39 Media Release 14 Speech 13 5 Doorstop 1 Press Conference 1 Abbott, Tony Transcript 620 Media Release 450 Speech 245 41 Remarks 5 Interview 2 Press Conference 2 Press Statement 2 Statement 2 Doorstop 1 Message 1 Chifley, Ben Speech 11 Curtin, John Speech 4 Fraser, Malcolm Media Release 1303 Speech 446 Interview 183 Statement in Parliament 53 Correspondence 51 Press Conference 42 ? 1 Communique 1 Report 1 Gillard, Julia Media Release 842 Interview 589 ... Menzies, Robert Article 8 Foreword 2 Correspondence 1 Memo 1 Programme 1 Rudd, Kevin Interview 697 Media Release 604 Speech 329 Video Transcript 74 Doorstop 9 Website Updates 7 Blog Transcript 6 Press Conference 6 Chat Transcript 3 Turnbull, Malcolm Transcript 1124 Media Release 550 Speech 65 12 Press Statement 1 Whitlam, Gough Media Release 790 Speech 245 Press Conference 89 Broadcast 49 Interview 32 Statement in Parliament 21 Statement 7 Communique 2 Article 1 Message 1 Report 1 Name: release_type, Length: 163, dtype: int64
Because the release_type
field is a bit of a mess, let's get the top ten release types and visualise them for each PM.
# The index returned by value_counts() are the names of the release types, so we can convert to a list and slice off the first 10.
top_ten_types = df.loc[df['release_type'] != '']['release_type'].value_counts().index.to_list()[:10]
# We're exluding those transcripts where the 'pm' value is empty - df.loc[df['pm'] != '']
alt.Chart(df.loc[(df['pm'] != '') & (df['release_type'].isin(top_ten_types))]).mark_bar().encode(
x='count():Q',
y='pm:N',
color='release_type:N'
)
df.loc[df['subjects'] != ''].shape
(1122, 7)
len(pd.unique(df['subjects']))
1036
df['subjects'].value_counts()
21692 Same-sex marriage 14 Budget 2014. 14 East Timor 13 Bali tragedy 7 Budget 2014 5 National Energy Guarantee 5 Bali tragedy. 4 Same sex marriage 3 Malaysia Airlines tragedy. 3 Citizenship 3 Federal Budget 3 Australian Defence Force contribution to international coalition against ISIL, visit to Arnhem Land, constitutional recognition for the first Australians. 3 Malaysia Airlines tragedy 3 Meeting of the Council of Australian Governments. 2 G20 volunteer launch, G20 Summit, Malaysia Airlines flight MH17, World infrastructure hub, ACOSS report, school curriculum, Ebola, Chinaâs coal tariff, Brisbane transport video. 2 Anzac Day commemorations. 2 Qantas. 2 G20 2 Same-sex marriage survey 2 Zimbabwe 2 Malaysia Airlines Flight MH17. 2 Budget 2015. 2 Victorian infrastructure, sentencing, same-sex marriage 2 International supply mission to Iraq 2 Malaysia Airlines Flight MH17 2 The Budget 2 Malaysia Airlines Flight MH17, Operation Sovereign Borders. 2 Budget 2015 2 Joint Counter Terrorism Team Operation. 2 ... Deployment of Australian troops to East Timor 1 Prime Ministerâs visit to the Torres Strait, constitutional recognition for the first Australians, income tax. 1 Official opening of the Devondale Murray Goulburn Dairy Beverages Centre, China-Australia Free Trade Agreement negotiations, new Senate, Operation Sovereign Borders, Competition Policy Review, Commonwealth Bank Review. 1 Visit to the USA, Iraq, direct action plan to reduce carbon emissions, Socceroosâ World Cup campaign. 1 Mark Waugh; trade links 1 Energy prices, airport security, terrorist attacks in Indonesia, 2018 Federal Budget, immigration, pre-selections, protests in Gaza and the Royal Wedding 1 Terrorist Attack in Turkey 1 Jobs, economic growth, trade, Labor failure 1 Great Barrier Reef announcement; National Energy Guarantee 1 Same-sex marriage; 2018; Citizenship; Nationals 1 New South Wales bushfires, disaster recovery payments. 1 the Federal Governmentâs commitment to grow small business, the Australian automotive industry, the Federal Governmentâs commitment to repeal the carbon tax, intelligence agencies, Operation Sovereign Borders, live cattle exports, parliamentary entitlemen 1 Grand Gateway Exchange opens ahead of schedule, Syrian humanitarian crisis, Canning by-election, Andrew Hastie, domestic violence. 1 Company tax cuts; Electricity Prices; Leadership 1 Royal Flying Doctor Service; Regional health; Water management; Indigenous incarceration rates 1 East Timor, Steve Pratt and Peter Wallace, Branko Jelen, Defence capabilty, China and Taiwan relations, Olympic tickets, Driza-bone, Austudy, memorial for Reverend John Flynn, defence spending, Victoria election, buy Australia campaign, business tax reform, republic referendum, Rafter at the US Open 1 Referendum result; Preamble result. 1 Antarctic icebreaker 1 the Federal Governmentâs commitment to repeal the carbon tax, Laborâs debt legacy, Operation Sovereign Borders, Indonesia. 1 Knights of the Order of Australia, the Governmentâs plans for 2015, Death of His Royal Highness King Abdullah bin Abdulaziz al-Saud of the Kingdom of Saudi Arabia. 1 Export deals; Defence Industry; Australian jobs; Banks; Polls 1 Economy, Jobs, Growth, Trade, 1 People smuggling cooperation with Sri Lanka, CHOGM, Productivity Commission Inquiry to focus on more flexible, affordable and accessible child care, GrainCorp. 1 Australian Cyber Security Centre; delivering lower energy prices for families and businesses; wages growth; business tax cuts 1 Great Barrier Reef; water policy; back to school 1 East Asia Summit 2013, Australia-Japan relationship, Trans-Pacific Partnership, minor parties, entitlements. 1 Stronger economy, Maurice Newman, resources, Operation Sovereign Borders, superannuation, housing sector, IPCCâs report, university places, Parliamentary sittings, minor parties. 1 Funding agreement to fast-track construction of WestConnex, building the roads of the 21st century for New South Wales, Budget 2014. 1 School funding; Higher education funding; GST distribution 1 Senator Fraser Anning 1 Name: subjects, Length: 1036, dtype: int64