Well it's December and for the third year in a row, it's time for the WXPN A-Z countdown. This time we are returning to the 1980s with the XPN 80's A-Z. And once again taking a crack at doing statistical analysis along the way.
%matplotlib inline
from IPython.display import display, HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>
''')
Most people are not interested in how I pull and clean the data. If you are, feel free to look at my Data Loading notebook. If you just want a copy of the raw data, feel free to grab a copy of 80sA2Z.csv.
import pandas as pd
from datetime import date, datetime, time, timedelta
from os import path
data_dir = './data'
playlist_file = path.join(data_dir, '80sA2Z.csv')
playlist = pd.read_csv(playlist_file)
playlist['Air Time'] = pd.to_datetime(playlist['Air Time'], errors='coerce')
last_play = playlist.loc[playlist['Air Time'].idxmax()]
end_time = last_play['Air Time'] + timedelta(seconds = 60 * last_play['Duration'])
HTML('<p>So far, as of %s, we have seen %d tracks with %d unique titles, from %d artists.</p>' %\
(end_time.strftime('%b %d %I:%M%p'),
len(playlist),
playlist.describe(include='all')['Title']['unique'],
playlist.describe(include='all')['Artist']['unique']
))
So far, as of Dec 09 11:49AM, we have seen 3431 tracks with 3357 unique titles, from 1100 artists.
Note, there are no longer any data gaps, thanks to the WXPN data elves.
This was the first thing I looked at originally, and it serves as a good "how are we doing" metric during the count down.
import seaborn as sns
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(6, 8))
sns.set_color_codes('pastel')
sns.countplot(y='Letter', data=playlist, color='b')
ax.set(xlabel="Songs Played", ylabel="First Letter")
[<matplotlib.text.Text at 0x7f3881ee7150>, <matplotlib.text.Text at 0x7f3881ec71d0>]
c = playlist['Letter'].value_counts()
letters = pd.DataFrame(zip(c.keys().tolist(), c.tolist()), columns=('Letter', 'Count'))
letters_csv = path.join(data_dir, '80s_letters.csv')
letters.to_csv(letters_csv, index=False)
HTML('<p>The same data is available as <a href="%s">%s</a>.<p>' % (letters_csv, path.basename(letters_csv)))
The same data is available as 80s_letters.csv.
Given the decade orientation this time, it makes sense to break the songs down by year.
f, ax = plt.subplots(figsize=(6, 6))
sns.set_color_codes('pastel')
sns.countplot(y='Year', data=playlist[playlist['Year'] > 0], color='b')
ax.set(xlabel="Songs Played", ylabel="Year")
[<matplotlib.text.Text at 0x7f3881e01b50>, <matplotlib.text.Text at 0x7f3881d9b310>]
c = playlist['Year'].value_counts()
years = pd.DataFrame(zip(c.keys().tolist(), c.tolist()), columns=('Year', 'Count'))
years = years[years['Year'] > 0]
years_csv = path.join(data_dir, '80s_years.csv')
years.to_csv(years_csv, index=False)
HTML('<p>The same data is available as <a href="%s">%s</a>.<p>' % (years_csv, path.basename(years_csv)))
The same data is available as 80s_years.csv.
The original playlist was dominated by the Beatles. The 70s list was a bit more even, and it was kind of fun to watch the lead change over time. Now, with 20/20 hindsight, the 80s playlist was lead from start to finish by Bruce Springsteen. But more interesting was Hüsker Dü, who had 22 tracks despite a short 5 year publishing history, of which the only 4 were used as nothing from Everything Falls Apart made the list.
c = playlist['Artist'].value_counts()
artists = pd.DataFrame(zip(c.keys().tolist(), c.tolist()),
columns=('Artist', 'Count'))
f, ax = plt.subplots(figsize=(6, 8))
sns.set_color_codes('pastel')
sns.barplot(y='Artist', x='Count', data=artists[artists['Count'] > 14], color='b')
ax.set(xlabel="Appearences in the Playlist (so far)")
[<matplotlib.text.Text at 0x7f3881d10d90>]
artists_csv = path.join(data_dir, '80s_artists.csv')
artists.to_csv(artists_csv, index=False)
HTML('<p>The same data is available as <a href="%s">%s</a>.<p>' % (artists_csv, path.basename(artists_csv)))
The same data is available as 80s_artists.csv.
This isn't overly accurate, as it is based on the successive start times in the playlist, and those only have one minute granularity. Plus the hosts do pause to talk every few songs, and that isn't factored out. But it's the best we have. Besides someone asked about this. Playing more songs helps with total duration, but so does long songs.
artist_durations = playlist.groupby('Artist')['Duration'].sum().to_frame()
artist_durations = artist_durations.reset_index()
artist_durations = artist_durations.sort_values(by='Duration', ascending=False)
f, ax = plt.subplots(figsize=(6, 8))
sns.set_color_codes('pastel')
sns.barplot(y='Artist', x='Duration', data=artist_durations.head(20), color='b')
ax.set(xlabel="Total Time in the Playlist (so far)")
[<matplotlib.text.Text at 0x7f38818e7250>]
artist_durations_csv = path.join(data_dir, '80s_artist_durations.csv')
artist_durations.to_csv(artist_durations_csv, index=False)
HTML('<p>The same data is available as <a href="%s">%s</a>.<p>' % (artist_durations_csv, path.basename(artist_durations_csv)))
The same data is available as 80s_artist_durations.csv.
Has any artist played all the letteres so far? If not, who came close? How about all the years? That takes fewer songs in the countdown, but places an onus on career longevity. When the smoke cleared, Springsteen who had the most plays had release clustered on a few years. By contrast Prince hit 9 of 10 years, and R.E.M. hit 8.
c = playlist[playlist.groupby('Artist')['Artist'].transform('size') >= 14]
c = c.groupby(['Artist', 'Letter']).count()
c.reset_index(level=[0,1], inplace=True)
c = c.pivot('Artist', 'Letter', 'Title').fillna(0)
f, ax = plt.subplots(figsize=(7, 8))
sns.set_color_codes('pastel')
ax = sns.heatmap(c, annot=True, cmap='PuBu')
ax.set(xlabel="Songs played by letter for Artists with 14 or more Songs (so far)")
[<matplotlib.text.Text at 0x7f388171e2d0>]
c = playlist[playlist.groupby('Artist')['Artist'].transform('size') >= 14]
c = c[c['Year'] > 0]
c = c.groupby(['Artist', 'Year']).count()
c.reset_index(level=[0,1], inplace=True)
c = c.pivot('Artist', 'Year', 'Title').fillna(0)
f, ax = plt.subplots(figsize=(6, 8))
sns.set_color_codes('pastel')
ax = sns.heatmap(c, annot=True, cmap='PuBu')
ax.set(xlabel="Songs played by year for Artists with 14 or more Songs (so far)")
[<matplotlib.text.Text at 0x7f3880937f90>]
Part of what started this last year was looking at first words. I didn't think that would be interesting, but when "I" took over the huge run of the letter I, I figured it was worth watching.
c = playlist['First Word'].value_counts()
words = pd.DataFrame(zip(c.keys().tolist(), c.tolist()),
columns=('First Word', 'Count'))
f, ax = plt.subplots(figsize=(6, 6))
sns.set_color_codes('pastel')
sns.barplot(y='First Word', x='Count', data=words.head(20), color='b')
ax.set(xlabel="Appearences in the Playlist (so far)")
[<matplotlib.text.Text at 0x7f388090c750>]
first_words_csv = path.join(data_dir, '80s_first_words.csv')
words.head(50).to_csv(first_words_csv, columns=['First Word', 'Count'], index=False)
HTML('<p>The same data is available as <a href="%s">%s</a>.<p>' % (first_words_csv, path.basename(first_words_csv)))
The same data is available as 80s_first_words.csv.
We can look at which songs took the longest. During the playlist, sometimes missing data creates artificial "long tracks." Generally these get corrected, so what's here now seems legit. Of course of the three 16 minute songs, there has to be a track by Phish.
long_songs = playlist.sort_values(by='Duration', ascending=False)
f, ax = plt.subplots(figsize=(6, 6))
sns.set_color_codes('pastel')
sns.barplot(y='Title', x='Duration', data=long_songs.head(20), color='b')
ax.set(xlabel="Longest Songs in the Playlist (so far)")
[<matplotlib.text.Text at 0x7f387eb31510>]
long_songs_csv = path.join(data_dir, '80s_long_songs.csv')
long_songs.head(20).to_csv(long_songs_csv, columns=['Artist', 'Title', 'Duration'], index=False)
HTML('<p>The same data is available as <a href="%s">%s</a>.<p>' % (long_songs_csv, path.basename(long_songs_csv)))
The same data is available as 80s_long_songs.csv.
Before the dawn of A-Z, XPN did year end countdowns based on listener vote. In 2014 the main list was the 885 Best Songs of All Time. It had a parallel mini-list the 88 Worst Songs of All time.
So for everyone in twitterdom, who is complaining "why is song or anything by artist in this countdown?" let's see what the overlap is.
best885_file = path.join(data_dir, '885best.csv')
best885 = pd.read_csv(best885_file)
worst88_file = path.join(data_dir, '88worst.csv')
worst88 = pd.read_csv(worst88_file)
besties = pd.merge(playlist, best885, how='inner', on=['Title', 'Artist'])
besties.to_csv(path.join(data_dir, '80s_and_885Best.csv'), index=False)
horrors = pd.merge(playlist, worst88, how='inner', on=['Title', 'Artist'])
horrors.to_csv(path.join(data_dir, '80s_and_88Worst.csv'), index=False)
s= "<p>Of the %d tracks in the 80s A-Z so far, " + \
"%d or %0.2f%% where in 2014's 885 best playlist. " + \
"Those are available as <a href='data/80s_and_885Best.csv'>80s_and_885Best.csv</a>. " + \
"Sadly %d were in 2014's 88 worst playlist. " + \
"Those are available as <a href='data/80s_and_88Worst.csv'>80s_and_88Worst.csv</a>.</p>"
HTML(s %(len(playlist), len(besties), float(len(besties) * 100) / float(len(playlist)),
len(horrors)))
Of the 3431 tracks in the 80s A-Z so far, 69 or 2.01% where in 2014's 885 best playlist. Those are available as 80s_and_885Best.csv. Sadly 11 were in 2014's 88 worst playlist. Those are available as 80s_and_88Worst.csv.
So what were the songs that were part of 88Worst?
HTML(horrors.to_html(index=False, columns=['Title', 'Artist']))
Title | Artist |
---|---|
Come On Eileen | Dexy's Midnight Runners |
Don't Stop Believin' | Journey |
Livin' On A Prayer | Bon Jovi |
Mickey | Toni Basil |
Mr. Roboto | Styx |
Never Gonna Give You Up | Rick Astley |
Party All The Time | Eddie Murphy |
Sussudio | Phil Collins |
Uptown Girl | Billy Joel |
We Built This City | Starship |
We Didn't Start The Fire | Billy Joel |
In truth it's worse than that, two tracks that were on the 88Worst, but spelled differently, so as not to match:
But keep in mind, a good number of the 88Worst were really "someone else's favorite song you got tired of hearing."
Last year there was a 70's A-Z Countdown, and in 2016 there was the original generic A-Z Countdown. We can compare the current playlist to them.
originals_file = path.join(data_dir, 'A2Z.csv')
originals = pd.read_csv(originals_file)
seventies_file = path.join(data_dir, '70sA2Z.csv')
seventies = pd.read_csv(seventies_file)
reruns = pd.merge(playlist, originals, how = 'inner', on = ['Title', 'Artist'])
reruns_file = path.join(data_dir, '80s_reruns.csv')
reruns.to_csv(reruns_file, index=False, encoding='utf-8')
s = """
<p>Of the %d tracks played so far, %d tracks or %0.2f%%
were played as part of the original play list.
The list is in <a href='%s'>%s</a></p>.
"""
HTML(s % (len(playlist), len(reruns), float(len(reruns) * 100)/float(len(playlist)),
reruns_file, path.basename(reruns_file)))
Of the 3431 tracks played so far, 856 tracks or 24.95% were played as part of the original play list. The list is in 80s_reruns.csv
.playlist['List'] = '80s'
playlist['Count'] = 1
seventies['List'] = '70s'
seventies['Count'] = 1
originals['List'] = 'Originals'
originals['Count'] = 1
combined = pd.concat([originals, seventies, playlist])
summary = combined.groupby(['List', 'Letter'])['Count', 'Duration'].sum()
summary = summary.reset_index()
sns.set_color_codes('pastel')
sns.factorplot(x='Letter', y='Count', hue='List', data=summary, kind='bar', size=6)
sns.plt.title('Tracks Played by Letter Comparing Playlists')
sns.factorplot(x='Letter', y='Duration', hue='List', data=summary, kind='bar', size=6)
sns.plt.title('Air Time (min) by Letter Comparing Playlists')
<matplotlib.text.Text at 0x7f387e182f50>
Duplicate titles are not necessarily covers. Many are. Other times it is case of "same title, different song." There have been a lot of 2 songs same title, but only a few with more than two songs.
c = playlist['Title'].value_counts()
title_counts = pd.DataFrame(zip(c.keys().tolist(), c.tolist()),
columns=('Title', 'Count'))
f, ax = plt.subplots(figsize=(6, 4))
sns.set_color_codes('pastel')
sns.countplot(y='Count', data=title_counts[title_counts['Count'] > 1], color='b')
ax.set(xlabel="Number of Tracks with Title", ylabel="Number of Titles")
[<matplotlib.text.Text at 0x7f387e053d10>, <matplotlib.text.Text at 0x7f387e0fda90>]
HTML(title_counts[title_counts['Count'] > 2].sort_values(by='Title').to_html(index=False))
Title | Count |
---|---|
Desire | 3 |
Heartbeat | 3 |
Magic | 3 |
The code for this project is in my git hub repo. The notebook its self is published on nbviewer
This project is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. You are free to use for commercial or non-commercial purposes, so long as you attribute the source and also allow sharing.