GA DAT13 Final Presentation - Matt Lentz

November 19, 2014 - Updated December 7, 2014


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')

Project Overview

Music is one of the most definitive elements of popular culture. From the Beatles’ “Yesterday” in 1965 to 2014’s “All About that Bass” by Meghan Trainor, every generation is, of course, defined by the music to which they listened. However, as generations age, what happens to the songs that defined them? How are they perceived by the incumbent ‘it’ generation: the millennials? What exactly about a song makes it popular? Using song metadata from the Billboard Top 40 from the past half century and machine learning algorithms, I will analyze the attrition of song popularity over the past 54 years, as well as build my own 'genres' of popular music using unsupervised learning techniques.

In [2]:
songs = pd.read_csv('C:\Users\Matt\SkyDrive\Documents\GA Data Science\Final Project/song_genre.csv')
genres = pd.DataFrame(songs.genre.value_counts())
genre_list = ['rock','r&b','country','soul','hip hop','dance','rap','jazz','folk','disco','funk',
              'reggae','latin','house','electronic','blues','metal']
song_genre = songs[songs['genre'].isin(genre_list)]
song_genre = song_genre.drop_duplicates()
song_genre['date'] = pd.to_datetime(song_genre['date'],format='%m/%d/%Y')
genre_dummy = pd.get_dummies(song_genre['genre'])
song_date = song_genre['date']
song_date = genre_dummy.join(song_date)
song_group = song_date.groupby('date').agg('sum')

In [3]:
song_group.plot(kind='area',figsize=(30,18),colormap ='Paired')
plt.legend(loc=2,fontsize=15,ncol=3,markerscale=100)
plt.suptitle('Billboard Top 40 by Genre (Jan 1960 - Sep 2014)',fontsize=35)
plt.yticks(fontsize=20)
plt.xticks(fontsize=25)
plt.xlabel('date',fontsize=25)
Out[3]:
<matplotlib.text.Text at 0x109a0ef0>

The time series plot represents the evolution of genres in the Billboard Top 40 list over the past half century. As evident by the graph, rock music has dominated the list for most of its 54 years of existence. Starting in 1990, one can see that the genres in the Billboard Top 40 begin to diversify with the emergence of R&B, hip hop and rap. Today we see strong representation from country, rock, and even the resurgence of disco, a genre long left for dead at the beginning of the 1980s.


Data Collection

The data for this project was collected in 3 stages.

Stage 1: Webscrape Spotify URI codes from the Billboard Hot 100 list using Python's BeautifulSoup module.

Searching for the weekly top 40 songs from the past 50 years initally proved very difficult. On its webpage, Billboard only displays the Top 10 songs from historic dates, and other sources would prove difficult to webscrape with many line breaks and inconsistent spacing. In examining the HTML on the Billboard archive, I noticed that Spotify had created complete playlists for each week. This information was much easier to scrape and ultimately led to a seamless aquisition of song metadata for my dataset.

Billboard Top 100

data = pd.DataFrame(columns=('date','rank','uri'))
data2 = pd.DataFrame(columns=('title','artist','popularity','uri'))


for year in range(1960,2015):
    for month in range(1,13):
        if month<10:
            month = str(0)+str(month)
        for day in range(1,32):
            if day < 10:
                day = str(0)+str(day)
            date = str(month)+'-'+str(day)+'-'+str(year)
            try:
                date1 = datetime.strptime(date, '%m-%d-%Y')
                if datetime.weekday(date1) == 5:
                    try:            
                        link = urllib2.urlopen('http://www.billboard.com/charts/%s-%s-%s/hot-100' % (year,month,day))
                        soup = BeautifulSoup(link)
                        spotsoup1 = soup.findAll('a')
                        splitsoup1 = str(spotsoup1[58]).split('\"')
                        splitsoup2= splitsoup1[3].split(':')
                        splitsoup3 = splitsoup2[2].split(',')

                        for i in range(40):
                            dic = {}
                            dic['date'] = date
                            dic['rank'] = str(i+1)
                            dic['uri'] = str(splitsoup3[i])
                            data = data.append(dic,ignore_index=True)

                    except TypeError:
                        pass
            except ValueError:
                pass

Stage 2: Perform API requests against Spotify's metadata API to pull in song title, artist, and current popularity.

spoturl = 'http://ws.spotify.com/lookup/1/.json?uri=spotify:track:'

uri_list = data['uri'].unique()

for i in uri_list:
    dic2 = {}
    url = spoturl + str(i)
    response = urllib2.urlopen(url)
    json_object = json.load(response)
    dic2['title'] = json_object['track']['name']
    dic2['artist'] = json_object['track']['artists'][0]['name']
    dic2['popularity'] = json_object['track']['popularity']
    dic2['uri'] = i
    data2 = data2.append(dic2,ignore_index=True)

project_data = pd.merge(data,data2,on='uri')
project_data['rank'] = project_data['rank'].astype('int')
project_data['popularity'] = project_data['popularity'].astype('float')
project_data.sort(['date','rank'])
project_data.head()

Stage 3: Perform API requests against Echonest's API to retrieve song metadata, such as loudness, duration, and instrumentalness.

echourl = 'http://developer.echonest.com/api/v4/track/profile?api_key=API&format=json&id=spotify:track:'
echourl2 ='&bucket=audio_summary'

features = ['danceability','duration','energy','instrumentalness','key','liveness','loudness','speechiness','tempo','time_signature']

for i in uri_list:
    try:
        uri = i
        url = echourl+uri+echourl2
        dic3 = {}
        response = urllib2.urlopen(url)
        json_object = json.load(response)
        for f in features:
            dic3[f] = json_object['response']['track']['audio_summary'][f]
        dic3['uri'] = i
        data3 = data3.append(dic3,ignore_index=True)
    except KeyError:
        pass

The Final Dataset

The dataset consists of 114,400 elements comprised of 11,114 unique songs from 3,428 artists, spanning 2,860 weeks.

In [4]:
songs = pd.read_csv('C:\Users\Matt\SkyDrive\Documents\GA Data Science\Final Project\song_data.csv')
songs.head()
Out[4]:
date rank uri title artist popularity danceability duration energy instrumentalness key liveness loudness speechiness tempo time_signature
0 1/2/1960 1 3hvakqVpwaz4L7zN5HfTCY Why Frankie Avalon 0.27 0.422345 155.23955 0.409041 5.260000e-09 5 0.112646 -8.54 0.026997 94.986 4
1 1/9/1960 2 3hvakqVpwaz4L7zN5HfTCY Why Frankie Avalon 0.27 0.422345 155.23955 0.409041 5.260000e-09 5 0.112646 -8.54 0.026997 94.986 4
2 1/16/1960 2 3hvakqVpwaz4L7zN5HfTCY Why Frankie Avalon 0.27 0.422345 155.23955 0.409041 5.260000e-09 5 0.112646 -8.54 0.026997 94.986 4
3 1/23/1960 2 3hvakqVpwaz4L7zN5HfTCY Why Frankie Avalon 0.27 0.422345 155.23955 0.409041 5.260000e-09 5 0.112646 -8.54 0.026997 94.986 4
4 1/30/1960 3 3hvakqVpwaz4L7zN5HfTCY Why Frankie Avalon 0.27 0.422345 155.23955 0.409041 5.260000e-09 5 0.112646 -8.54 0.026997 94.986 4

As the dataset was built mostly from API pulls, there was very little work in terms of clearning the actual data.

In [5]:
pd.scatter_matrix(songs[['popularity','danceability','duration','tempo','loudness','energy']],figsize=(20,20))
plt.suptitle('Scatterplot Matrix of Song Attributes',size=25)
Out[5]:
<matplotlib.text.Text at 0x1099a828>