It's countdown time at XPN. While I took a year off from doing stats last year, as the station did a bang up job and I had other things going on, I'm up for another crack at it.
This year it's the 90s A-Z countdown, which stated at 8am on December 1st. There's a Real Time Countdown and the usual Twitter banter.
%matplotlib inline
from IPython.display import display, HTML
It's Thursday, Dec 8, things have been running for over a week. And We're just finished Z at 8:30.
Most people don't care about how I get the data.
If you do, checkout my Dataloading notebook.
If you just want to play with the data,
the playlist with some data augmentation, is here as 90sA2Z.csv.
If you do something interesting, let me know,
and post with the #XPN90sAtoZ
hashtag so people can find it.
import pandas as pd
from datetime import date, datetime, time, timedelta
from os import path
data_dir = './data'
playlist_file = path.join(data_dir, '90sA2Z.csv')
playlist = pd.read_csv(playlist_file)
playlist['Air Time'] = pd.to_datetime(playlist['Air Time'], errors='coerce')
last_play = playlist.loc[playlist['Air Time'].idxmax()]
end_time = last_play['Air Time'] + timedelta(seconds = 60 * last_play['Duration'])
HTML('<p>So far, as of %s, we have seen %d tracks with %d unique titles, from %d artists.</p>' %\
(end_time.strftime('%b %d %I:%M%p'),
len(playlist),
playlist.describe(include='all')['Title']['unique'],
playlist.describe(include='all')['Artist']['unique']
))
So far, as of Dec 08 08:27AM, we have seen 2133 tracks with 2083 unique titles, from 860 artists.
import seaborn as sns
import matplotlib.pyplot as plt
c = playlist['Artist'].value_counts()
artists = pd.DataFrame(zip(c.keys().tolist(), c.tolist()),
columns=('Artist', 'Count'))
f, ax = plt.subplots(figsize=(6, 8))
sns.set_color_codes('pastel')
sns.barplot(y='Artist', x='Count', data=artists[artists['Count'] > 10], color='b')
ax.set(xlabel="Appearences in the Playlist (so far)")
[<matplotlib.text.Text at 0x7f4bcb167450>]
I'd always wanted to group on albums. But using an external source like MusicBrainz to find / guess album names based on artist and title was kind of problematic. Now that album is in the new playlist format, this is easy.
albums = playlist.groupby(['Artist', 'Album']).size().reset_index().rename(columns={0:'Count'})
albums = albums.sort_values(by=['Count'], ascending=False)
albums["label"] = albums['Artist'] + ', ' + albums['Album']
f, ax = plt.subplots(figsize=(6, 8))
sns.set_color_codes('pastel')
sns.barplot(y='label', x='Count', data=albums[albums['Count'] > 5], color='b')
ax.set(xlabel="Appearences in the Playlist (so far)")
ax.set(ylabel="Album")
[<matplotlib.text.Text at 0x7f4bcafafc50>]
If we are doing A to Z, it's probably good to know how far we are, and have some notion of where we're going.
f, ax = plt.subplots(figsize=(6, 8))
sns.set_color_codes('pastel')
sns.countplot(y='Letter', data=playlist, color='b')
ax.set(xlabel="Songs Played", ylabel="First Letter")
[<matplotlib.text.Text at 0x7f4bcae60810>, <matplotlib.text.Text at 0x7f4bcae12450>]
Well, years are what a 90s countdown is about. I've had mixed results in earlier years trying to get release dates by looking for matches on MusicBrainz. So, right now I'm using the datas from the 90s A-Z playlist page. They may not be a lot better. We're only batting a bit over 90% But here's what we got.
print "Of %d tracks, %d had valid dates and %d did not" % \
(len(playlist), len(playlist[playlist['Year'] > 0]), len(playlist[playlist['Year'] == 0]))
Of 2133 tracks, 1937 had valid dates and 196 did not
f, ax = plt.subplots(figsize=(6, 8))
sns.set_color_codes('pastel')
sns.countplot(y='Year', data=playlist[playlist['Year'] > 0], color='b')
ax.set(xlabel="Songs Played", ylabel="Release Year")
[<matplotlib.text.Text at 0x7f4bc8fd9710>, <matplotlib.text.Text at 0x7f4bd9bf84d0>]
Back in the origional A-Z the first thing I did was break things down by popular first words.
c = playlist['First Word'].value_counts()
words = pd.DataFrame(zip(c.keys().tolist(), c.tolist()),
columns=('First Word', 'Count'))
f, ax = plt.subplots(figsize=(6, 8))
sns.set_color_codes('pastel')
sns.barplot(y='First Word', x='Count', data=words[words['Count'] > 10], color='b')
ax.set(xlabel="Tracks by First Word")
[<matplotlib.text.Text at 0x7f4bc9042510>]
Has an artist had a track played for each letter yet? Unlikely, but who comes the closest? This really relates to number of tracks played and, to some degree, diversity of naming. Similary, has any artist had a track played for each year? That's an easier goal, but relies a lot on consistant output.
c = playlist[playlist.groupby('Artist')['Artist'].transform('size') >= 10]
c = c.groupby(['Artist', 'Letter']).count()
c.reset_index(level=[0,1], inplace=True)
c = c.pivot('Artist', 'Letter', 'Title').fillna(0)
f, ax = plt.subplots(figsize=(7, 8))
sns.set_color_codes('pastel')
ax = sns.heatmap(c, annot=True, cmap='PuBu')
ax.set(xlabel="Songs played by letter for Artists with 10 or more Songs (so far)")
[<matplotlib.text.Text at 0x7f4bc8d539d0>]
c = playlist[playlist.groupby('Artist')['Artist'].transform('size') >= 10]
c = c[c['Year'] > 0]
c = c.groupby(['Artist', 'Year']).count()
c.reset_index(level=[0,1], inplace=True)
c = c.pivot('Artist', 'Year', 'Title').fillna(0)
f, ax = plt.subplots(figsize=(6, 8))
sns.set_color_codes('pastel')
ax = sns.heatmap(c, annot=True, cmap='PuBu')
ax.set(xlabel="Songs played by year for Artists with 10 or more Songs (so far)")
[<matplotlib.text.Text at 0x7f4bc81a2c90>]
At some level the duration we calculate is somewhat suspect. There are station id breaks, and such. So just counting start to start is iffy. But it's still interesting.
long_songs = playlist.sort_values(by='Duration', ascending=False)
f, ax = plt.subplots(figsize=(6, 6))
sns.set_color_codes('pastel')
sns.barplot(y='Title', x='Duration', data=long_songs.head(15), color='b')
ax.set(xlabel="Longest Songs in the Playlist (so far)")
[<matplotlib.text.Text at 0x7f4bc3fcf5d0>]
How often do we get runs of two or more consecutive tracks by the same artist? It happens. The alphabet is a strange thing.
Note, I'm not sure that the same artist playing the same song really counts. But the two plays of All Apologies by Nirvana really happened. They were from In Utero and MTV Unplugged in New York. So a self cover?
def sequential_runs(tracks):
runs = pd.DataFrame(None, columns = ['Artist', 'Songs'])
artist = None
titles = []
for idx in tracks.index:
if tracks['Artist'][idx] == artist:
titles.append(tracks['Title'][idx])
else:
if len(titles) > 1:
runs = runs.append({'Artist': artist,
'Songs': titles},
ignore_index=True)
artist = tracks['Artist'][idx]
titles = [tracks['Title'][idx]]
return runs
HTML(sequential_runs(playlist).to_html(index=False))
Artist | Songs |
---|---|
Nirvana | [All Apologies, All Apologies] |
Gomez | [Get Miles, Get Myself Arrested] |
Pearl Jam | [Given to Fly, Glorified G] |
Guided By Voices | [I Am A Scientist, I Am A Tree] |
The Lemonheads | [It's A Shame About Ray, It's About Time] |
Ben Harper | [Mama's Got A Girlfriend, Mama's Trippin'] |
Wilco | [Outta Mind (Outta Sight), Outtasite (Outta Mi... |
Nirvana | [Pennyroyal Tea, Pennyroyal Tea] |
Los Lobos | [Reva's House, Revolution] |
10,000 Maniacs | [These Are Days, These Days] |
Duplicate titles could be covers. or they could just be "same name, different song".
c = playlist['Title'].value_counts()
title_counts = pd.DataFrame(zip(c.keys().tolist(), c.tolist()),
columns=('Title', 'Count'))
f, ax = plt.subplots(figsize=(6, 4))
sns.set_color_codes('pastel')
sns.countplot(y='Count', data=title_counts[title_counts['Count'] > 1], color='b')
ax.set(xlabel="Number of Tracks with Title", ylabel="Number of Titles")
[<matplotlib.text.Text at 0x7f4bc3d85350>, <matplotlib.text.Text at 0x7f4bc3dfe110>]
HTML(title_counts[title_counts['Count'] > 1].sort_values(by='Title').to_html(index=False))
Title | Count |
---|---|
All Apologies | 3 |
Alright | 2 |
Be Thankful For What You Got | 2 |
Believe | 2 |
Blue | 2 |
Changes | 2 |
Closing Time | 2 |
Congo Square | 2 |
Crazy | 2 |
Creep | 4 |
December | 2 |
Drown | 2 |
Early In The Morning | 2 |
Everyday Is A Winding Road | 2 |
Excuse Me Mr. | 2 |
Friend Of The Devil | 2 |
Hold On | 4 |
I Know | 2 |
I Need Love | 2 |
It's Alright | 2 |
Last Goodbye | 2 |
Late At Night | 2 |
Liar | 2 |
Maria | 2 |
Nothing Compares 2 U | 2 |
One | 2 |
Pennyroyal Tea | 2 |
Right Here, Right Now | 3 |
Ripple | 2 |
Sexuality | 2 |
Shine | 2 |
Show Me Love | 2 |
Spoonman | 2 |
Stars | 2 |
Stones In The Road | 2 |
Summertime | 2 |
Superstar | 2 |
Tennessee | 2 |
There She Goes | 2 |
Time Bomb | 2 |
Wild Horses | 2 |
You | 2 |
You And Your Sister | 2 |
You Bowed Down | 2 |
The code for this project is in my github repo and this file is specifically 90sA2Z.ipynb.
This project is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. You are free to use for commercial or non-commercial purposes, so long as you attribute the source and also allow sharing.