Data Loading and Cleaning¶

Status¶

It's now 2020. Last year we didn't do a countdown. This year, it's not A to Z, but the top 2020 songs of all time. It just started this morning, but I bet we can fit that into the same format.

Overview¶

Everyone of these countdowns is just a bit different, and they are a bit of a surprise as they evolve. So the data collection and clean up is usually evolving under time pressure. But as we're up to year three, patterns emerge and I've managed to clean some of it up.

What follows is the latest start of the art, with a bit of clutter from the accumulated countdown data.

Setup¶

Under the covers it mostly a combo of requests and lxml for web-scraping and pandas for data munging. Before we get started, set up the imports.

In [1]:

%matplotlib inline
from IPython.display import display, HTML
import requests 
from lxml import html
import pandas as pd
import numpy as np
from datetime import date, datetime, time
from os import path, mkdir
import re

Set up cache/data directories¶

Whenever possible, we'll cache the data. Partially for speed when rerunning the notebooks during the countdown, but also to make the notebooks reproducible later, if the data ends up moving.

In [2]:

cache_dir = './cache'
playlist_cache_dir = path.join(cache_dir, 'playlists')
a2z_cache_dir = path.join(cache_dir, 'a2z')
a2z70s_cache_dir = path.join(cache_dir, 'a2z70s')
a2z80s_cache_dir = path.join(cache_dir, 'a2z80s')
xpn8080_cache_idir = path.join(cache_dir, 'xpn2020')
musicbrainz_cache_dir = path.join(cache_dir, 'musicbrainz')
data_dir = './data'

for d in (cache_dir, playlist_cache_dir, a2z_cache_dir, a2z70s_cache_dir,
          a2z80s_cache_dir, data_dir, musicbrainz_cache_dir):
    if not path.exists(d): mkdir(d)

Generic XPN Playlist scraping¶

Originally I tended to rely on the one-off countdown pages for playlists. But eventually I ended up using the generic playlist at http://xpn.org/playlists/xpn-playlist. I've done this enough, it's past time to turn it into something reusable.

In [3]:

def fetch_daily_playlist(day, cache_dir=None, verbose = False):
    """
    Fetches the XPN playlist for a given date
    
    Args:
        day (datetime.date) : The day to fetch the playlist for
        cache_dir (string)  : Path to the cache directory, or None to avoid caching
    
    Returns:
        DataFrame containing Artist and Title as Strings and Airtime as Timestamp
    """
    songs = pd.DataFrame(None, columns=['Artist', 'Title', 'Air Time'])
    if cache_dir is not None:
        cache_file =  path.join(cache_dir, "%04d-%02d-%02d.csv" % \
                                (day.year, day.month, day.day))
    if cache_file is not None and path.exists(cache_file):
        songs = pd.read_csv(cache_file, encoding='utf-8')
        songs['Air Time'] = pd.to_datetime(songs['Air Time'], errors='coerce')
        if verbose: print "Got %d rows from %s" % (len(songs), cache_file)
    else:
        day_s = '%02d-%02d-%04d' % (day.month, day.day, day.year)
        page = requests.post('https://xpn.org/playlists/xpn-playlist',
                                 data = {'playlistdate': day_s})
        if verbose: print "fetching %s returned status %s" % (day_s, page.status_code)
        
        # play list pages claim to be utf-8, but the rare non-ascii character
        # is always latin-1
        #tree = html.fromstring(page.content.decode('latin-1'))
        tree = html.fromstring(page.content)
        tracks = tree.xpath('//h3/a/text()')
        # not all rows are tracks, some are membership callouts
        # but real tracks start with times and are formatted
        # HH:MM [am|pm] Artist - Title
        # Note that I've seen titles with embedded dashes,
        # but so far no artist names with them.  This may be luck.
        # Special programs like World Cafe, Echos, ...
        # also start with an air time, but don't have useful track info
        # but those list the program inside bars
        # eg |World Cafe| -  "Wednesday 11-2-2016 Hour 2, Part 7"
        date_regex = re.compile("^\d{2}:\d{2}\s")
        line_count= 0
        track_count = 0
        for track in tracks:
            line_count += 1
            if date_regex.match(track) and track[9:10] != '|':
                (artist, title) = track[9:].split(' - ', 1)
                dt = datetime.strptime(track[:8], '%I:%M %p')
                air_time = datetime.combine(day, dt.time())
                if verbose: print "adding %s %s %s" % (artist, title, air_time)
                songs = songs.append({'Artist': artist,
                                      'Title': title,
                                      'Air Time': air_time},
                                     ignore_index = True)
                if verbose: print "size = %d" % len(songs)
                track_count += 1
            
        if verbose: print 'read %d line and added %d tracks' % (line_count, track_count)
        # Drop any duplicates, which are not uncommon
        songs = songs.drop_duplicates()
        if cache_file is not None:
            songs.to_csv(cache_file, index=False, encoding='utf-8')
            if verbose: print 'write %d rows to %s' % (len(songs), cache_file)
    
    return songs

In [4]:

def fetch_playlist(start, end, cache_dir=None):
    """
    Fetch all the playlist entries for a range of time.
    
    Args:
        start (datetime.datetime) : The inclusive start time to fetch entries for
        end (datetime.datetime)   : The exclusive end time to fetch entries for
        cache_dir (string)        : path to the cache directory, or None to avoid caching
    
    Returns:
        Dataframe containing Artist and Title as strings, and Airtime as timestamp
    """
    songs = pd.DataFrame(None, columns=['Artist', 'Title', 'Air Time'])
    for day in pd.date_range(start.date(), end.date()):
        songs = songs.append(fetch_daily_playlist(day, cache_dir), ignore_index=True)
    songs = songs[songs['Air Time'] >= start]
    songs = songs[songs['Air Time'] < end]
    # sometimes the playlist entries are duplicated
    song = songs.drop_duplicates()
    songs = songs.sort_values(by = 'Air Time')
    return songs

Load the playlists¶

Since this is the third time, and I've pulled the data prep into one notebook, to save redundancy, there are a few play lists to load.

XPN 2020¶

For 2020, XPN is doing a listener curated "top 2020 songs" countdown. It just started this morning, Thursday December 10 at 8:00 am.

Two twists this year. First, like 2018, there is a pause. The countdown stopped at numbrer 101, just after midnight on December 17, and picked back up for the last 100 at 8am. Also, from 6am to 8am, there is a mini-list of One Vote Wonders, songs that only got one vote in the XPN 2020 polling.

In [5]:

xpn2020 = fetch_playlist(datetime(2020, 12, 10, 8, 0), datetime(2020, 12, 17, 0, 41),
                          playlist_cache_dir)
print "got %d rows" % len(xpn2020)

got 1920 rows

In [6]:

xpn2020_onsies = fetch_playlist(datetime(2020, 12, 17, 6, 0), datetime(2020, 12, 17, 8, 0),
                          playlist_cache_dir)
print "got %d rows" % len(xpn2020_onsies)

got 24 rows

In [7]:

xpn2020pt2  = fetch_playlist(datetime(2020, 12, 17, 8, 0), datetime(2020, 12, 17, 18, 46),
                          playlist_cache_dir)
print "got %d rows" % len(xpn2020pt2)

got 100 rows

Before going further, let's take a quick look at the data.

In [8]:

HTML(xpn2020.head(5).to_html())

Out[8]:

	Artist	Title	Air Time
194	Booker T. & The MG's	Time Is Tight	2020-12-10 08:02:00
193	AC/DC	T.N.T.	2020-12-10 08:05:00
192	Peter Frampton	Show Me the Way	2020-12-10 08:11:00
191	The Drifters	Under The Boardwalk	2020-12-10 08:16:00
190	Adele	Rumor Has It	2020-12-10 08:19:00

In [9]:

HTML(xpn2020.tail(5).to_html())

Out[9]:

	Artist	Title	Air Time
2123	Foo Fighters	Everlong	2020-12-17 00:08:00
2122	Bob Marley & The Wailers	Three Little Birds	2020-12-17 00:12:00
2121	Pearl Jam	Alive	2020-12-17 00:17:00
2120	Joni Mitchell	Both Sides Now	2020-12-17 00:23:00
2119	Elton John	Funeral For A Friend/Love Lies Bleeding	2020-12-17 00:30:00

In [10]:

HTML(xpn2020_onsies.head(5).to_html())

Out[10]:

	Artist	Title	Air Time
123	J. J. Cale	After Midnight	2020-12-17 06:03:00
122	Shocking Blue	Venus	2020-12-17 06:05:00
121	Ben Folds Five	Brick	2020-12-17 06:10:00
120	Sarah McLachlan	Sweet Surrender	2020-12-17 06:14:00
119	World Party	When The Rainbow Comes	2020-12-17 06:22:00

In [11]:

HTML(xpn2020_onsies.tail(5).to_html())

Out[11]:

	Artist	Title	Air Time
104	Missy Elliott	Work It	2020-12-17 07:32:00
103	Wu-Tang Clan	C.R.E.A.M. (Cash Rules Everything Around Me)	2020-12-17 07:36:00
102	Robert Hazard	Escalator Of Life	2020-12-17 07:42:00
101	Rickie Lee Jones	Chuck E's In Love	2020-12-17 07:45:00
100	Taj Mahal	Ain't Gwine Whistle Dixie	2020-12-17 07:52:00

In [12]:

HTML(xpn2020pt2.head(5).to_html())

Out[12]:

	Artist	Title	Air Time
99	The Doors	Light My Fire	2020-12-17 08:02:00
98	Miles Davis	All Blues	2020-12-17 08:10:00
97	John Prine	Hello In There	2020-12-17 08:24:00
96	Joni Mitchell	River	2020-12-17 08:28:00
95	Jason Isbell & The 400 Unit	If We Were Vampires	2020-12-17 08:32:00

In [13]:

HTML(xpn2020pt2.tail(5).to_html())

Out[13]:

	Artist	Title	Air Time
4	Bruce Springsteen	Born To Run	2020-12-17 18:15:00
3	The Rolling Stones	Gimme Shelter	2020-12-17 18:20:00
2	Bob Dylan	Like A Rolling Stone	2020-12-17 18:26:00
1	John Lennon	Imagine	2020-12-17 18:33:00
0	Bruce Springsteen	Thunder Road	2020-12-17 18:39:00

XPN 80's A to Z playlist¶

The 80's playlist started on Wednesday November 28 2018 at 8:00 am. As of this writing it just ended yesterday. However, something unusual happened this time: we took a break from 1am to 6am on 12-09. So it's easier to treat it as two playlists, and merge them after we calculate durations. The alternative is to allow passing in lists of breaks to duration calculation, if there were more breaks, we might.

In [14]:

eighties = fetch_playlist(datetime(2018, 11, 28, 8, 0), datetime(2018,12,9, 1, 0),
                          playlist_cache_dir)
print "got %d rows" % len(eighties)

got 3360 rows

In [15]:

eighties2 = fetch_playlist(datetime(2018, 12, 9, 6, 0), datetime(2018, 12, 9, 11, 49), playlist_cache_dir)
print "got %d rows" % len(eighties2)

got 71 rows

Before going an further, let's take a quick look at what we loaded:

In [16]:

HTML(eighties.head(5).to_html())

Out[16]:

	Artist	Title	Air Time
209	Warren Zevon	A Certain Girl	2018-11-28 08:01:00
208	U2	A Day Without Me	2018-11-28 08:04:00
207	The Cure	A Forest	2018-11-28 08:07:00
206	The Waterboys	A Girl Called Johnny	2018-11-28 08:13:00
205	Romeo Void	A Girl in Trouble (Is a Temporary Thing)	2018-11-28 08:18:00

In [17]:

HTML(eighties.tail(5).to_html())

Out[17]:

	Artist	Title	Air Time
3613	Crowded House	World Where You Live	2018-12-09 00:38:00
3612	Captain Sensible	Wot	2018-12-09 00:41:00
3611	Eurythmics	Would I Lie To You?	2018-12-09 00:44:00
3610	Nik Kershaw	Wouldn't It Be Good 12"	2018-12-09 00:49:00
3609	Black Flag	Wound Up	2018-12-09 00:56:00

In [18]:

HTML(eighties2.head(5).to_html())

Out[18]:

	Artist	Title	Air Time
160	Ringo Starr	Wrack My Brain	2018-12-09 06:01:00
159	The Fabulous Thunderbirds	Wrap It Up	2018-12-09 06:03:00
158	The Police	Wrapped Around Your Finger	2018-12-09 06:06:00
157	Bruce Springsteen	Wreck On The Highway	2018-12-09 06:12:00
156	Neil Young	Wrecking Ball	2018-12-09 06:16:00

In [19]:

HTML(eighties.describe(include='all', percentiles=[]).to_html(na_rep=''))

Out[19]:

	Artist	Title	Air Time
count	3360	3360	3360
unique	1088	3288	3360
top	Bruce Springsteen	Heartbeat	2018-12-08 04:08:00
freq	50	3	1
first			2018-11-28 08:01:00
last			2018-12-09 00:56:00

80s Leftovers¶

In 2016, there was a follow on "leftovers" list for parentheticals, numbers and other random non-alphabeticals. In 2018, the playlist transitioned right into the leftovers. But since it doesn't align with any of the comparisons, I'm going to treat it separately for now.

In [20]:

eighties_leftovers = fetch_playlist(datetime(2018, 12, 9, 11, 50), datetime(2018, 12, 9, 15, 0),
                          playlist_cache_dir)
print "got %d rows" % len(eighties_leftovers)

got 42 rows

In [21]:

HTML(eighties_leftovers.head(5).to_html())

Out[21]:

	Artist	Title	Air Time
89	Minutemen	#1 Hit Song	2018-12-09 11:50:00
88	The Blow Monkeys with Curtis Mayfield	(Celebrate) The Day After You	2018-12-09 11:52:00
87	Ministry	(Every Day Is) Halloween	2018-12-09 12:00:00
86	Cutting Crew	(I Just) Died in Your Arms	2018-12-09 12:07:00
85	Joan Armatrading	(I Love It When You) Call Me Names	2018-12-09 12:12:00

In [22]:

HTML(eighties_leftovers.describe(include='all', percentiles=[]).to_html(na_rep=''))

Out[22]:

	Artist	Title	Air Time
count	42	42	42
unique	41	42	42
top	U2	(It's Not Me) Talking	2018-12-09 13:51:00
freq	2	1	1
first			2018-12-09 11:50:00
last			2018-12-09 14:57:00

XPN 70's A to Z playlist¶

The 70s AtoZ started at 6:00 am on Nov 29 2107, and ended at 7:00 pm on Dec 12 2017.

In [23]:

seventies = fetch_playlist(datetime(2017, 11, 29, 6, 0), datetime(2017, 12, 12, 19, 0), playlist_cache_dir)

# Cover what looks like a Free at Noon slid into the play list
seventies = seventies[seventies['Title'] != 'The Runner']

print "got %d rows" % len(seventies)

got 4157 rows

In [24]:

HTML(seventies.head(5).to_html())

Out[24]:

	Artist	Title	Air Time
219	Steeleye Span	A Calling-On Song	2017-11-29 06:02:00
218	Joni Mitchell	A Case Of You	2017-11-29 06:03:00
217	Boz Scaggs	A Clue	2017-11-29 06:07:00
216	Todd Rundgren	A Dream Goes On Forever	2017-11-29 06:13:00
215	Lou Reed	A Gift	2017-11-29 06:16:00

In [25]:

HTML(seventies.describe(include='all', percentiles=[]).to_html(na_rep=''))

Out[25]:

	Artist	Title	Air Time
count	4157	4157	4157
unique	1028	4000	4154
top	David Bowie	She's Gone	2017-12-10 23:17:00
freq	63	3	2
first			2017-11-29 06:02:00
last			2017-12-12 18:54:00

The Original A-Z Playlist¶

The original A-Z playlist ran in 2016 from November 30 at 6:00 am until December 17 at 1:30 pm.

In [26]:

originals = fetch_playlist(datetime(2016, 11, 30, 6, 0), datetime(2016, 12, 17, 13, 30), playlist_cache_dir)

print "got %d rows" % len(originals)

got 5691 rows

In [27]:

HTML(originals.head(5).to_html())

Out[27]:

	Artist	Title	Air Time
245	Jackson 5	ABC	2016-11-30 06:01:00
244	Elvis Presley	A Big Hunk O' Love	2016-11-30 06:04:00
243	Johnny Cash	A Boy Named Sue (live)	2016-11-30 06:06:00
242	Joni Mitchell	A Case Of You	2016-11-30 06:10:00
241	Ernie K-Doe	A Certain Girl	2016-11-30 06:16:00

In [28]:

HTML(originals.describe(include='all', percentiles=[]).to_html(na_rep=''))

Out[28]:

	Artist	Title	Air Time
count	5691	5691	5691
unique	1658	5294	5687
top	The Beatles	Hold On	2016-12-10 17:46:00
freq	141	5	2
first			2016-11-30 06:01:00
last			2016-12-17 13:25:00

Augmenting the Data¶

Scraping the Playlist Specific Pages¶

For the original, 70s, and 80s A-Z, but not the A-Z leftovers, the station put up countdown specific pages with play lists in a slightly different format. One advantage of using them is that they only include tracks from the countdown, avoiding any need for time checking the date range. Another is that for the 70s A-Z, they added lists by year. Given the pain it was to search MusicBrainz for songs and figure out the year, that's worth having.

70s A-Z Page¶

Alphabetical Lists¶

Now that I've moved to the main playlist, I don't know that the alphabetical lists buy much. Getting the first letter ourselves is pretty easy. But since older versions of the code used it, we'll at least archive them

In [29]:

#alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
#
#seventies_by_letter = pd.DataFrame(None, columns = ['Title', 'Artist', 'Letter'])
#for letter in alphabet:
#    cache_file = path.join(a2z70s_cache_dir, '%s.csv' % letter)
#    if path.exists(cache_file):
#        df = pd.read_csv(cache_file)
#    else:
#        rows = []
#        page = requests.get('http://xpn.org/static/az2017.php?q=%s' % letter)
#        tree = html.fromstring(page.content)
#        songs = tree.xpath('//li/text()')
#        for song in songs:
#            rows.append(song.rsplit(' - ', 1) + [letter])
#        df = pd.DataFrame(rows, columns=['Title', 'Artist', 'Letter'])
#    df.to_csv(cache_file, index=False)
#    seventies_by_letter = seventies_by_letter.append(df, ignore_index=True)
#
#print "got %d songs by letter" % len(seventies_by_letter)
# was 4202 before commenting out

Lists by Year¶

In [30]:

#years = map(str, range(1970,1980))
#seventies_by_year = pd.DataFrame(None, columns = ['Title', 'Artist', 'Year'])
#for year in years:
#    cache_file = path.join(a2z70s_cache_dir, '%s.csv' % year)
#    if path.exists(cache_file):
#        df = pd.read_csv(cache_file)
#    else:
#        rows = []
#        page = requests.get('http://xpn.org/static/az2017v2.php?q=%s' % year)
#        tree = html.fromstring(page.content)
#        songs = tree.xpath('//li/text()')
#        for song in songs:
#            rows.append(song.rsplit(' - ', 1) + [year])
#        df = pd.DataFrame(rows, columns=['Title', 'Artist', 'Year'])
#    df.to_csv(cache_file, index=False)
#    seventies_by_year = seventies_by_year.append(df, ignore_index=True)
#
#seventies_by_year.to_csv(path.join(data_dir, 'seventies_by_year.csv'))
#print 'got %d songs by year' % len(seventies_by_year)
# was 3699 before comment out

Best and Worst¶

Before the A-Z countdowns, there used to be the "885" countdowns. Each year had a theme. 2014's theme was "All Time Greatest and Wort Songs", where there was the traditional 885 countdown for best and a side 88 Worst list. As people comment on what got included in the A-Z countdowns, which are curated by the station, it's fun to compare against the best and worst which were based on listener voting.

In [31]:

def fetch_best(playlist_url, pagecount):
    """
    Fetch data from the 885 best or 88 worst playlists.
    Both use the same format, just different urls and 
    more or fewer pages.
    
    Args:
        playlist_url (string) : base url for the playlist
        pagecount (int) : number of pages to ge
    Returns:
        DataFrame containing the track data
    """
    
    rows = []
    
    for page_no in range(1, pagecount + 1):
        args = {'page': page_no}
        page = requests.get(playlist_url, params = args)
        tree = html.fromstring(page.content)
        tracks = tree.xpath("//*/tr[@class='countdown']")
        for track in tracks:
            artist = track.xpath('./td[2]/text()')[0]
            title = track.xpath('./td[@class="song"]/text()')[0]
            rows.append([title, artist])
    df = pd.DataFrame(rows, columns = ['Title', 'Artist'])
    return df

In [32]:

best885_file = path.join(data_dir, '885best.csv')
if not path.exists(best885_file):
    best885 = fetch_best('http://www.xpn.org/music-artist/885-countdown/2014/885-countdown-2014',18)
    best885.to_csv(best885_file, index=False)
else:
    best885 = pd.read_csv(best885_file)
    
HTML(best885.head(5).to_html())
                     

Out[32]:

	Title	Artist
0	Thunder Road	Bruce Springsteen
1	Like A Rolling Stone	Bob Dylan
2	Imagine	John Lennon
3	A Day In The Life	The Beatles
4	Born To Run	Bruce Springsteen

In [33]:

worst88_file = path.join(data_dir, '88worst.csv')
if not path.exists(worst88_file):
    worst88 = fetch_best('http://www.xpn.org/music-artist/885-countdown/2014/885-countdown-2014-88-worst',2)
    worst88.to_csv(worst88_file, index=False)
else:
    worst88 = pd.read_csv(worst88_file)
    
HTML(worst88.head(5).to_html())

Out[33]:

	Title	Artist
0	We Built This City	Starship
1	Who Let The Dogs Out	Baha Men
2	Achy Breaky Heart	Billy Ray Cyrus
3	(You're) Having My Baby	Paul Anka
4	Macarena	Los Del Rio

Putting it together¶

One might think that we can just join up the data. However there is a catch. There are some cases where one or more of the URLs will return legitimate duplicates. For example two entries for the same song / artist at the same time in the main playlist page. However there are also valid entries for the same song / artist, at different times, released in different years. The catch is that there is no common key between our three sources to join on. If we dedupe on title and artist we drop real tracks. But doing a Cartesian product would generate 4 apparent tracks for two tracks. So we need to build an artificial key.

In [34]:

#seventies = seventies.sort_values(by='Air Time')
#seventies['Play'] = pd.Series([0 for x in range(len(seventies.index))], index=seventies.index)
#last = None
#count = 0
#for idx, row in seventies.iterrows():
#    if last is None or last != (row['Title'], row['Artist']):
#        last = (row['Title'], row['Artist'])
#        count = 0
#    else:
#        count += 1
#        seventies.loc[idx, 'Play'] = count
#
#seventies_by_letter = seventies_by_letter.drop_duplicates()
#
#seventies_by_year = seventies_by_year.sort_values(by=['Title', 'Artist'])
#seventies_by_year['Play'] = pd.Series([0 for x in range(len(seventies_by_year.index))], index=seventies_by_year.index)
#last = None
#count = 0
#for idx, row in seventies_by_year.iterrows():
#    if last is None or last != (row['Title'], row['Artist']):
#        last = (row['Title'], row['Artist'])
#        count = 0
#    else:
#        count += 1
#        seventies_by_year.loc[idx, 'Play'] = count
#
#seventies = seventies.merge(seventies_by_year, how='left', on=['Artist', 'Title', 'Play'])
#seventies = seventies.merge(seventies_by_letter, how='left', on=['Artist', 'Title'])
#seventies['Year'] = seventies['Year'].fillna(0.0).astype(int)
#seventies['Air Time'] = pd.to_datetime(seventies['Air Time'], errors='coerce')

Extracting Initial Letters¶

For the moment, let's ignore the countdown specific pages. We likely need to resort to MusicBrainz for year of publication data for the 80s countdown.

And first letter is pretty easy. Well nothing is ever 100% easy. I've seen leading spaces (could be stripped during initial load) and words that start with leading apostrophes such as 'Til. So we need to scan past any non-alphabetics.

In [35]:

def first_char(s):
    for c in s:
        if type(c) is str and c.isalpha():
            return c.upper()
    return s[0]
    
originals = originals.join(originals.apply(lambda x: first_char(x[1]), axis=1).to_frame('Letter'))
seventies = seventies.join(seventies.apply(lambda x: first_char(x[1]), axis=1).to_frame('Letter'))
eighties = eighties.join(eighties.apply(lambda x: first_char(x[1]), axis=1).to_frame('Letter'))
eighties2 = eighties2.join(eighties2.apply(lambda x: first_char(x[1]), axis=1).to_frame('Letter'))
xpn2020 = xpn2020.join(xpn2020.apply(lambda x: first_char(x[1]), axis=1).to_frame('Letter'))
xpn2020_onsies = xpn2020_onsies.join(xpn2020_onsies.apply(lambda x: first_char(x[1]), axis=1).to_frame('Letter'))
xpn2020pt2 = xpn2020pt2.join(xpn2020pt2.apply(lambda x: first_char(x[1]), axis=1).to_frame('Letter'))

For the non-alphabetic leftovers, we'll do first character instead, so no skipping past non-alphabetics.

In [36]:

eighties_leftovers = eighties_leftovers.join(eighties_leftovers.apply(lambda x: x[1][0].upper(), axis=1).to_frame('First Character'))

Extracting First Words¶

In [37]:

from nltk.tokenize import RegexpTokenizer
custom_tokenize = RegexpTokenizer("[\w'\-]+|[^\w'\s\-]").tokenize
originals = originals.join(originals.apply(lambda x: custom_tokenize(x[1])[0], axis=1).to_frame('First Word'))
seventies = seventies.join(seventies.apply(lambda x: custom_tokenize(x[1])[0], axis=1).to_frame('First Word'))
eighties = eighties.join(eighties.apply(lambda x: custom_tokenize(x[1])[0], axis=1).to_frame('First Word'))
eighties2 = eighties2.join(eighties2.apply(lambda x: custom_tokenize(x[1])[0], axis=1).to_frame('First Word'))
xpn2020 = xpn2020.join(xpn2020.apply(lambda x: custom_tokenize(x[1])[0], axis=1).to_frame('First Word'))
xpn2020_onsies = xpn2020_onsies.join(xpn2020_onsies.apply(lambda x: custom_tokenize(x[1])[0], axis=1).to_frame('First Word'))
xpn2020pt2 = xpn2020pt2.join(xpn2020pt2.apply(lambda x: custom_tokenize(x[1])[0], axis=1).to_frame('First Word'))

Estimating Durations¶

Since we have air times, we can approximate durations by subtracting the air time from the next track's air times. There are a couple catches with this

we need to pass in an explicit end time for the last track, but that's minor
we need to add some logic to 'skip over' the free at noons that happen on Fridays form 12 noon till "like 12:40 or so" and don't appear in the playlist at all
the granularity is a bit course, as it is on a one minute basis. We could be off by almost two minutes per song, but it ought to even out.
there's no clear way to account for "non-song time" like station promos, hosts introducing songs, station ids, and so forth. Fortunately, the percentage of time that is really music is pretty high thanks to XPN being listener supported.

In [38]:

def estimate_durations(playlist, end_time=None):
    """
    Estimate the song durations
    Args: 
        playlist (DataFrame): playlist with minimally an 'Air Time' attribute
        end_time (datetime): end time of the play list, or None if still going
    Return:
        modified DataFrame with 'Duration' attribute added.
    """
    
    playlist['Duration'] = pd.Series([0 for x in range(len(playlist.index))], index=playlist.index)
    previous = None
    last_idx = None
    for idx, row in playlist.iterrows():
        if not previous is None:
            if row['Air Time'].date().weekday() == 4 and previous.hour == 11 and row['Air Time'].hour == 12:
                # We just fell into a free at noon
                playlist.loc[last_idx, 'Duration'] = 60 - previous.minute
            else:
                # just subtract this start from the previous
                delta = row['Air Time'] - previous
                playlist.loc[last_idx, 'Duration'] = delta.seconds / 60
        previous = row['Air Time']
        last_idx = idx

    # fixup the last row
    if end_time is not None:    
        delta = end_time - playlist.loc[last_idx,'Air Time']
        playlist.loc[last_idx, 'Duration'] = delta.seconds / 60
    
    return playlist
        

In [39]:

orginals = estimate_durations(originals, datetime(2016, 12, 17, 13, 30))
seventies = estimate_durations(seventies, datetime(2017, 12, 12, 19, 0))
eighties = estimate_durations(eighties, datetime(2018, 12, 9, 1, 0))
eighties2 = estimate_durations(eighties2, datetime(2018, 12, 9, 11, 49))
eighties_leftovers = estimate_durations(eighties_leftovers, datetime(2018, 12, 9, 15, 0))
xpn2020 = estimate_durations(xpn2020,datetime(2020, 12, 17, 0, 41))
xpn2020_onsies = estimate_durations(xpn2020_onsies, datetime(2020, 12, 17, 8, 0))
xpn2020pt2 = estimate_durations(xpn2020pt2, datetime(2020, 12, 17, 18, 46))

And now we can concatenate the 80s back into one data frame. And the same for 2020.

In [40]:

eighties = pd.concat([eighties, eighties2])
xpn2020 = pd.concat([xpn2020, xpn2020pt2])

And fix up the one remaining implausible duration. I'm going to assume that no 24 minute cut of Third World's You're Playing Us Too Close exists. The longest I can find it 7 minutes. Odds are we're still missing a couple tracks from where the playlist feed died about that time.

In [41]:

eighties.loc[eighties['Title'] ==  "You're Playing Us Too Close", 'Duration' ] = 7

MusicBrainz Data¶

MusicBrainz is an free online music database, with an external XML Web-service that is supported in Python via the musicbrainzngs library. I'd originally used it to get publication year for the 2016 countdown, but abandoned it in 2017 since the 2017 playlist page had lists by year. Since there's no year data on the 2018 playlist, I'm bringing it back.

There are a couple of potential issues with querying MusicBrainz

MusicBrainz has its own rules about how to enter data, that don't always match those at WXPN, so sometimes searches fail for data mismatches.
As a free volunteer based service, there's no guarantee that the data is there, though their data-set is very complete.
Finding the right recording is an art at best. My general approach has been to look for the oldest official release for any recording matching the title and artist. That mostly works.

So we'll get what we can programmatically via Musicbrainz. Then we'll look up the outliers manually, using some combination of Discos and random stuff we find on Google, and prefill the cache file manually for those. For some really deep cuts, I've resorted to reading the date off of pictures of 45s for sale on EBay. No one answer works, it's ugly, but sometimes so is the recording industry.

One consequence is that we'll always lag on publication year data during the running of the playlists.

In [42]:

def add_musicbrainz_data(playlist, min_year = 1900, cache_file = None):
    """
    Add data from the musicbrainz database.  Currently just first year of publication.
    The input data frame should contain at least Title and Artist fields
    and the resulting dataframe will have a new Year field.
    The cache file if used, should have been generated by a previous run of
    this function.
    Using a cache is strongly encouraged,
    as the MusicBrainz search interface is rate limited to one search per second
    so this can be very slow for large playlists.
    
    Args:
        playlist (Dataframe) : playlist to update
        min_year (int)       : miminum year to consider
        cache_file (string)  : path to cache file
         
    Returns:
        Dataframe containing the augmented playlist
    """
    import musicbrainzngs as mb
    mb.set_useragent('xpn-a2z', '0.1','https://github.com/asudell/a2z')
    
    # keep a list of artists named differently
    # at MusicBrainz than XPN, so we can 'fix' them
    artist_names = {
        "R. E. M.": "REM",
        "Run-DMC": "Run-D.M.C.",
        "The Ramones": "Ramones"
    }
    
    # load the cache if we have one
    if cache_file is not None and path.exists(cache_file):
        years = pd.read_csv(cache_file, encoding='utf-8')
        years = years.drop_duplicates()
    else:
        years = pd.DataFrame(None, columns=('Title','Artist', 'Year'))
    
    augmented = playlist.merge(years, how = 'left')
    
    # Lookup any unaugmented rows
    new_mb_rows = []
    for index, row in augmented[augmented['Year'].isnull()].iterrows():
        if row['Artist'] in artist_names:
            artist = artist_names[row['Artist']]
        else:
            artist = row['Artist']
        result = mb.search_recordings(row['Title'],
                                      artist = artist,
                                      status = 'official',
                                      strict = True,
                                      limit = 25)
        rel_year = None
        for recording in result['recording-list']:
            if recording['release-list']:
                for release in recording['release-list']:
                    if 'date' in release and len(release['date']) > 0:
                        y = int(release['date'].split('-')[0])
                        if rel_year is None or rel_year > y:
                            if y >= min_year:
                                # assume years before 1900 are typos
                                rel_year = y
        if rel_year is not None:
            new_mb_rows.append([row['Title'], row['Artist'], rel_year])
    
    new_years = pd.DataFrame(new_mb_rows, columns=('Title','Artist', 'Year'))
    # if we found new data, resave the cache and rebuild the augmented data
    if len(new_years) > 0:
        years = years.append(new_years, ignore_index=True)
        years = years.drop_duplicates()
        if cache_file is not None:
            years.to_csv(cache_file, index=False, encoding='utf-8')
        augmented = playlist.merge(years, how = 'left')
    
    return augmented
        
                            
                             

In [43]:

xpn2020 = add_musicbrainz_data(xpn2020, 1920, path.join(musicbrainz_cache_dir, 'xpn2020_years.csv'))
# save a copy of anything without a year for manual review
xpn2020_missing = xpn2020[xpn2020['Year'].isnull()][['Title', 'Artist']]
xpn2020_missing.to_csv(path.join(musicbrainz_cache_dir, 'xpn2020_need_years.csv'),
                       index=False, encoding='utf-8')
# need to do this?
mb_cache = pd.read_csv(path.join(musicbrainz_cache_dir, 'xpn2020_years.csv'))
mb_cache.to_csv(path.join(musicbrainz_cache_dir, 'xpn2020_years.csv'), index=False, encoding='utf-8')

In [44]:

xpn2020_onsies = add_musicbrainz_data(xpn2020_onsies, 1920, path.join(musicbrainz_cache_dir, 'xpn2020_onsies_years.csv'))
# save a copy of anything without a year for manual review
xpn2020_onsies_missing = xpn2020_onsies[xpn2020_onsies['Year'].isnull()][['Title', 'Artist']]
xpn2020_onsies_missing.to_csv(path.join(musicbrainz_cache_dir, 'xpn2020_onsies_need_years.csv'),
                       index=False, encoding='utf-8')
# need to do this?
mb_cache = pd.read_csv(path.join(musicbrainz_cache_dir, 'xpn2020_onsies_years.csv'))
mb_cache.to_csv(path.join(musicbrainz_cache_dir, 'xpn2020_onsies_years.csv'), index=False, encoding='utf-8')

In [45]:

eighties = add_musicbrainz_data(eighties, 1980, path.join(musicbrainz_cache_dir, '80s_years.csv'))
# Some recordings get released a lot, toss anything outside the 80s
# and we pick them up for manual review
for index, row in eighties.iterrows():
    if row['Year'] < 1980 or row['Year'] > 1989:
        eighties.at[index, 'Year'] = np.nan
# and save a copy of anything without a year for manual review
eighties_missing = eighties[eighties['Year'].isnull()][['Title', 'Artist']]
eighties_missing.to_csv(path.join(musicbrainz_cache_dir, '80s_need_years.csv'),
                        index=False, encoding='utf-8')
# finally, prune any out of range entries from the cache, as
# we will keep growing them and duplicating records on joins
mb_cache = pd.read_csv(path.join(musicbrainz_cache_dir, '80s_years.csv'))
mb_cache = mb_cache[(mb_cache['Year'] >= 1980) & (mb_cache['Year'] <= 1989)]
mb_cache.to_csv(path.join(musicbrainz_cache_dir, '80s_years.csv'), index=False, encoding='utf-8')

and the same for the leftovers ...

In [46]:

eighties_leftovers = add_musicbrainz_data(eighties_leftovers, 1980, path.join(musicbrainz_cache_dir, '80s_leftovers_years.csv'))
# Some recordings get released a lot, toss anything outside the 80s
# and we pick them up for manual review
for index, row in eighties_leftovers.iterrows():
    if row['Year'] < 1980 or row['Year'] > 1989:
        eighties_leftovers.at[index, 'Year'] = np.nan
# and save a copy of anything without a year for manual review
eighties_leftovers_missing = eighties_leftovers[eighties_leftovers['Year'].isnull()][['Title', 'Artist']]
eighties_leftovers_missing.to_csv(path.join(musicbrainz_cache_dir, '80s_leftovers_need_years.csv'),
                                  index=False, encoding='utf-8')
# finally, prune any out of range entries from the cache, as
# we will keep growing them and duplicating records on joins
mb_cache = pd.read_csv(path.join(musicbrainz_cache_dir, '80s_leftovers_years.csv'))
mb_cache = mb_cache[(mb_cache['Year'] >= 1980) & (mb_cache['Year'] <= 1989)]
mb_cache.to_csv(path.join(musicbrainz_cache_dir, '80s_leftovers_years.csv'), index=False, encoding='utf-8')

Just for reference the manual additions were

The ones I couldn't ever find good years far are left in

Checking the Results¶

In [47]:

HTML(xpn2020.head(10).to_html())

Out[47]:

	Artist	Title	Air Time	Letter	First Word	Duration	Year
0	Booker T. & The MG's	Time Is Tight	2020-12-10 08:02:00	T	Time	3	1980
1	AC/DC	T.N.T.	2020-12-10 08:05:00	T	T	6	1975
2	Peter Frampton	Show Me the Way	2020-12-10 08:11:00	S	Show	5	1975
3	The Drifters	Under The Boardwalk	2020-12-10 08:16:00	U	Under	3	1989
4	Adele	Rumor Has It	2020-12-10 08:19:00	R	Rumor	5	2011
5	Smith	Baby It's You	2020-12-10 08:24:00	B	Baby	4	1969
6	Aretha Franklin	Call Me	2020-12-10 08:28:00	C	Call	3	1970
7	Marvin Gaye & Kim Weston	It Takes Two	2020-12-10 08:31:00	I	It	3	1966
8	Curtis Mayfield	Superfly	2020-12-10 08:34:00	S	Superfly	7	1973
9	Shawn Colvin	I Don't Know Why	2020-12-10 08:41:00	I	I	4	1992

In [48]:

HTML(xpn2020_onsies.head(10).to_html())

Out[48]:

	Artist	Title	Air Time	Letter	First Word	Duration	Year
0	J. J. Cale	After Midnight	2020-12-17 06:03:00	A	After	2	NaN
1	Shocking Blue	Venus	2020-12-17 06:05:00	V	Venus	5	1969
2	Ben Folds Five	Brick	2020-12-17 06:10:00	B	Brick	4	1994
3	Sarah McLachlan	Sweet Surrender	2020-12-17 06:14:00	S	Sweet	8	1997
4	World Party	When The Rainbow Comes	2020-12-17 06:22:00	W	When	5	1990
5	Suzanne Vega	Marlene On The Wall	2020-12-17 06:27:00	M	Marlene	4	1985
6	Big Joe Turner	Shake Rattle And Roll	2020-12-17 06:31:00	S	Shake	3	1992
7	Wilbert Harrison	Let's Work Together (Parts 1 & 2)	2020-12-17 06:34:00	L	Let's	7	1994
8	The Pretenders	Talk Of The Town	2020-12-17 06:41:00	T	Talk	3	1980
9	The Proclaimers	I'm Gonna Be (500 Miles)	2020-12-17 06:44:00	I	I'm	7	1987

In [49]:

HTML(eighties.head(10).to_html())

Out[49]:

	Artist	Title	Air Time	Letter	First Word	Duration	Year
0	Warren Zevon	A Certain Girl	2018-11-28 08:01:00	A	A	3	1980
1	U2	A Day Without Me	2018-11-28 08:04:00	A	A	3	1980
2	The Cure	A Forest	2018-11-28 08:07:00	A	A	6	1980
3	The Waterboys	A Girl Called Johnny	2018-11-28 08:13:00	A	A	5	1983
4	Romeo Void	A Girl in Trouble (Is a Temporary Thing)	2018-11-28 08:18:00	A	A	7	1984
5	The Smithereens	A Girl Like You	2018-11-28 08:25:00	A	A	4	1989
6	Albert Collins	A Good Fool Is Hard To Find	2018-11-28 08:29:00	A	A	4	1986
7	Phil Collins	A Groovy Kind Of Love	2018-11-28 08:33:00	A	A	5	1988
8	The Weirdos	A Life Of Crime	2018-11-28 08:38:00	A	A	15	1985
9	Erasure	A Little Respect	2018-11-28 08:53:00	A	A	2	1988

In [50]:

HTML(eighties_leftovers.describe(include='all', percentiles=[]).to_html(na_rep=''))

Out[50]:

	Artist	Title	Air Time	First Character	Duration	Year
count	42	42	42	42	42.000000	42.000000
unique	41	42	42	12
top	U2	(It's Not Me) Talking	2018-12-09 13:51:00	(
freq	2	1	1	15
first			2018-12-09 11:50:00
last			2018-12-09 14:57:00
mean					4.523810	1984.142857
std					1.699901	2.824729
min					2.000000	1980.000000
50%					4.000000	1984.000000
max					9.000000	1989.000000

In [51]:

HTML(seventies.head(10).to_html())

Out[51]:

	Artist	Title	Air Time	Letter	First Word	Duration
219	Steeleye Span	A Calling-On Song	2017-11-29 06:02:00	A	A	1
218	Joni Mitchell	A Case Of You	2017-11-29 06:03:00	A	A	4
217	Boz Scaggs	A Clue	2017-11-29 06:07:00	A	A	6
216	Todd Rundgren	A Dream Goes On Forever	2017-11-29 06:13:00	A	A	3
215	Lou Reed	A Gift	2017-11-29 06:16:00	A	A	7
214	Poco	A Good Feelin' To Know	2017-11-29 06:23:00	A	A	3
213	Mac Davis	A Little Less Conversation	2017-11-29 06:26:00	A	A	3
212	Neil Young	A Man Needs A Maid	2017-11-29 06:29:00	A	A	4
211	Lou Rawls	A Natural Man	2017-11-29 06:33:00	A	A	3
210	David Bowie	A New Career In A New Town	2017-11-29 06:36:00	A	A	5

In [52]:

HTML(originals.head(10).to_html())

Out[52]:

	Artist	Title	Air Time	Letter	First Word	Duration
245	Jackson 5	ABC	2016-11-30 06:01:00	A	ABC	3
244	Elvis Presley	A Big Hunk O' Love	2016-11-30 06:04:00	A	A	2
243	Johnny Cash	A Boy Named Sue (live)	2016-11-30 06:06:00	A	A	4
242	Joni Mitchell	A Case Of You	2016-11-30 06:10:00	A	A	6
241	Ernie K-Doe	A Certain Girl	2016-11-30 06:16:00	A	A	3
240	Warren Zevon	A Certain Girl	2016-11-30 06:19:00	A	A	5
239	Sheryl Crow	A Change	2016-11-30 06:24:00	A	A	4
238	Sam Cooke	A Change Is Gonna Come	2016-11-30 06:28:00	A	A	3
237	The Beatles	A Day In The Life	2016-11-30 06:31:00	A	A	5
236	Ray Barretto	A Deeper Shade Of Soul	2016-11-30 06:36:00	A	A	4

Saving the data¶

In [53]:

#originals_data_file = path.join(data_dir, 'A2Z.csv')
#originals.to_csv(originals_data_file, index=False)
#seventies_data_file = path.join(data_dir, '70sA2Z.csv')
#seventies.to_csv(seventies_data_file, index=False)
#eighties['Year'] = eighties['Year'].fillna(value=0).astype(int)
#eighties_data_file = path.join(data_dir, '80sA2Z.csv')
#eighties.to_csv(eighties_data_file, index=False, encoding='utf8')
#eighties_leftovers['Year'] = eighties_leftovers['Year'].fillna(value=0).astype(int)
#eighties_leftovers_data_file = path.join(data_dir, '80sLeftovers.csv')
#eighties_leftovers.to_csv(eighties_leftovers_data_file, index=False, encoding='utf8')xpn
xpn2020['Year'] = xpn2020['Year'].fillna(value=0).astype(int)
xpn2020_data_file = path.join(data_dir, 'xpn2020.csv')
xpn2020.to_csv(xpn2020_data_file, index=False, encoding='utf8')
xpn2020_onsies['Year'] = xpn2020_onsies['Year'].fillna(value=0).astype(int)
xpn2020_onsies_data_file = path.join(data_dir, 'xpn2020_onsies.csv')
xpn2020_onsies.to_csv(xpn2020_onsies_data_file, index=False, encoding='utf8')