This notebook shows a few ways you can start to explore the ABC Radio National metadata harvested using this notebook.
For an earlier experiment playing with this data, see In a word...: Currents in Australian affairs, 2003–2013.
import pandas as pd
from pathlib import Path
import re
import altair as alt
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
Load the harvested data.
df = pd.read_csv(Path('data', 'abcrn-2021-01-15.csv'))
How many records are there?
df.shape[0]
df.columns
How many programs are there records for?
programs = list(df['isPartOf'].unique())
len(programs)
sorted(programs)
Which programs have the most records?
df['isPartOf'].value_counts()[:25]
To look at the number of records by year, we need to make sure the date
field is being recognised as a datetime
. Then we can extract the year into a new column.
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['year'] = df['date'].dt.year.astype('Int64')
Find the number of times each year appears.
year_counts = df['year'].value_counts().to_frame().reset_index()
year_counts.columns = ['year', 'count']
Chart the results.
alt.Chart(year_counts).mark_bar().encode(
x='year:O',
y='count:Q',
tooltip=['year', alt.Tooltip('count:Q', format=',')]
)
The early records look a bit suspect, and I should probably check them manually. I'm also wondering why there's been such a large decline in the number of records added since 2017.
The contributor
field includes the names of hosts, reporters, and guests. It's stored as a pipe-delimited string, so we have to split the string, then explode the resulting list to create one row per name.
people = df['contributor'].str.split('|').explode().dropna()
Then we can calculate how often people appear in the records.
people.value_counts()[:25]
wc_people = WordCloud(width=1000, height=500).fit_words(people.value_counts().to_dict())
wc_people.to_image()
There are three text fields that could yield some interesting analysis. The title
field is obvious enough, though some regular segments do have duplicate titles. The abstract
field is a brief summary of the segment or program. The description
field seems to be the beginning of the transcript, but often seems to be much the same as the abstract
.
Let's try aggregating the titles for a program.
breakfast_titles = list(df.loc[(df['isPartOf'].isin(['ABC Radio National. RN Breakfast', 'ABC Radio. RN Breakfast'])) & (df['year'] == 2020)].drop_duplicates(subset=['title'], keep=False)['title'].unique())
wordcloud = WordCloud(width=1000, height=500, stopwords=stopwords.words('english') + ['Australia', 'Australian', 'Australians', 'New', 'News', 'Matt', 'Bevan', 'World'], collocations=False).generate(' '.join(breakfast_titles))
wordcloud.to_image()
drive_titles = list(df.loc[(df['isPartOf'].isin(['ABC Radio National. RN Drive', 'ABC Radio. RN Drive'])) & (df['year'] == 2020)].drop_duplicates(subset=['title'], keep=False)['title'].unique())
wordcloud = WordCloud(width=1000, height=500, stopwords=stopwords.words('english') + ['Australia', 'Australian', 'Australians', 'New', 'News', 'Matt', 'Bevan', 'World'], collocations=False).generate(' '.join(drive_titles))
wordcloud.to_image()
# Drop records without a title
df_titles = df.dropna(subset=['title'])
# Find titles containing 'bushfire'
bushfires = df_titles.loc[df['title'].dropna().str.contains(r'bushfire', regex=True, flags=re.IGNORECASE)]
# Chart the results
alt.Chart(bushfires).mark_line().encode(
x='year(date):T',
y='count()',
tooltip=['year(date):T', 'count():Q']
)