import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
Music is one of the most definitive elements of popular culture. From the Beatles’ “Yesterday” in 1965 to 2014’s “All About that Bass” by Meghan Trainor, every generation is, of course, defined by the music to which they listened. However, as generations age, what happens to the songs that defined them? How are they perceived by the incumbent ‘it’ generation: the millennials? What exactly about a song makes it popular? Using song metadata from the Billboard Top 40 from the past half century and machine learning algorithms, I will analyze the attrition of song popularity over the past 54 years, as well as build my own 'genres' of popular music using unsupervised learning techniques.
songs = pd.read_csv('C:\Users\Matt\SkyDrive\Documents\GA Data Science\Final Project/song_genre.csv')
genres = pd.DataFrame(songs.genre.value_counts())
genre_list = ['rock','r&b','country','soul','hip hop','dance','rap','jazz','folk','disco','funk',
'reggae','latin','house','electronic','blues','metal']
song_genre = songs[songs['genre'].isin(genre_list)]
song_genre = song_genre.drop_duplicates()
song_genre['date'] = pd.to_datetime(song_genre['date'],format='%m/%d/%Y')
genre_dummy = pd.get_dummies(song_genre['genre'])
song_date = song_genre['date']
song_date = genre_dummy.join(song_date)
song_group = song_date.groupby('date').agg('sum')
song_group.plot(kind='area',figsize=(30,18),colormap ='Paired')
plt.legend(loc=2,fontsize=15,ncol=3,markerscale=100)
plt.suptitle('Billboard Top 40 by Genre (Jan 1960 - Sep 2014)',fontsize=35)
plt.yticks(fontsize=20)
plt.xticks(fontsize=25)
plt.xlabel('date',fontsize=25)
<matplotlib.text.Text at 0x109a0ef0>
The time series plot represents the evolution of genres in the Billboard Top 40 list over the past half century. As evident by the graph, rock music has dominated the list for most of its 54 years of existence. Starting in 1990, one can see that the genres in the Billboard Top 40 begin to diversify with the emergence of R&B, hip hop and rap. Today we see strong representation from country, rock, and even the resurgence of disco, a genre long left for dead at the beginning of the 1980s.
Searching for the weekly top 40 songs from the past 50 years initally proved very difficult. On its webpage, Billboard only displays the Top 10 songs from historic dates, and other sources would prove difficult to webscrape with many line breaks and inconsistent spacing. In examining the HTML on the Billboard archive, I noticed that Spotify had created complete playlists for each week. This information was much easier to scrape and ultimately led to a seamless aquisition of song metadata for my dataset.
data = pd.DataFrame(columns=('date','rank','uri'))
data2 = pd.DataFrame(columns=('title','artist','popularity','uri'))
for year in range(1960,2015):
for month in range(1,13):
if month<10:
month = str(0)+str(month)
for day in range(1,32):
if day < 10:
day = str(0)+str(day)
date = str(month)+'-'+str(day)+'-'+str(year)
try:
date1 = datetime.strptime(date, '%m-%d-%Y')
if datetime.weekday(date1) == 5:
try:
link = urllib2.urlopen('http://www.billboard.com/charts/%s-%s-%s/hot-100' % (year,month,day))
soup = BeautifulSoup(link)
spotsoup1 = soup.findAll('a')
splitsoup1 = str(spotsoup1[58]).split('\"')
splitsoup2= splitsoup1[3].split(':')
splitsoup3 = splitsoup2[2].split(',')
for i in range(40):
dic = {}
dic['date'] = date
dic['rank'] = str(i+1)
dic['uri'] = str(splitsoup3[i])
data = data.append(dic,ignore_index=True)
except TypeError:
pass
except ValueError:
pass
spoturl = 'http://ws.spotify.com/lookup/1/.json?uri=spotify:track:'
uri_list = data['uri'].unique()
for i in uri_list:
dic2 = {}
url = spoturl + str(i)
response = urllib2.urlopen(url)
json_object = json.load(response)
dic2['title'] = json_object['track']['name']
dic2['artist'] = json_object['track']['artists'][0]['name']
dic2['popularity'] = json_object['track']['popularity']
dic2['uri'] = i
data2 = data2.append(dic2,ignore_index=True)
project_data = pd.merge(data,data2,on='uri')
project_data['rank'] = project_data['rank'].astype('int')
project_data['popularity'] = project_data['popularity'].astype('float')
project_data.sort(['date','rank'])
project_data.head()
echourl = 'http://developer.echonest.com/api/v4/track/profile?api_key=API&format=json&id=spotify:track:'
echourl2 ='&bucket=audio_summary'
features = ['danceability','duration','energy','instrumentalness','key','liveness','loudness','speechiness','tempo','time_signature']
for i in uri_list:
try:
uri = i
url = echourl+uri+echourl2
dic3 = {}
response = urllib2.urlopen(url)
json_object = json.load(response)
for f in features:
dic3[f] = json_object['response']['track']['audio_summary'][f]
dic3['uri'] = i
data3 = data3.append(dic3,ignore_index=True)
except KeyError:
pass
The dataset consists of 114,400 elements comprised of 11,114 unique songs from 3,428 artists, spanning 2,860 weeks.
songs = pd.read_csv('C:\Users\Matt\SkyDrive\Documents\GA Data Science\Final Project\song_data.csv')
songs.head()
date | rank | uri | title | artist | popularity | danceability | duration | energy | instrumentalness | key | liveness | loudness | speechiness | tempo | time_signature | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1/2/1960 | 1 | 3hvakqVpwaz4L7zN5HfTCY | Why | Frankie Avalon | 0.27 | 0.422345 | 155.23955 | 0.409041 | 5.260000e-09 | 5 | 0.112646 | -8.54 | 0.026997 | 94.986 | 4 |
1 | 1/9/1960 | 2 | 3hvakqVpwaz4L7zN5HfTCY | Why | Frankie Avalon | 0.27 | 0.422345 | 155.23955 | 0.409041 | 5.260000e-09 | 5 | 0.112646 | -8.54 | 0.026997 | 94.986 | 4 |
2 | 1/16/1960 | 2 | 3hvakqVpwaz4L7zN5HfTCY | Why | Frankie Avalon | 0.27 | 0.422345 | 155.23955 | 0.409041 | 5.260000e-09 | 5 | 0.112646 | -8.54 | 0.026997 | 94.986 | 4 |
3 | 1/23/1960 | 2 | 3hvakqVpwaz4L7zN5HfTCY | Why | Frankie Avalon | 0.27 | 0.422345 | 155.23955 | 0.409041 | 5.260000e-09 | 5 | 0.112646 | -8.54 | 0.026997 | 94.986 | 4 |
4 | 1/30/1960 | 3 | 3hvakqVpwaz4L7zN5HfTCY | Why | Frankie Avalon | 0.27 | 0.422345 | 155.23955 | 0.409041 | 5.260000e-09 | 5 | 0.112646 | -8.54 | 0.026997 | 94.986 | 4 |
As the dataset was built mostly from API pulls, there was very little work in terms of clearning the actual data.
pd.scatter_matrix(songs[['popularity','danceability','duration','tempo','loudness','energy']],figsize=(20,20))
plt.suptitle('Scatterplot Matrix of Song Attributes',size=25)
<matplotlib.text.Text at 0x1099a828>
Despite the randomness of the scatterplots, the histograms of the features were almost normally distributed. I found the distribution of the data to be surprising considering the weekly top 40 is a seemingly random collection of songs with no real relation to one another.
Furthermore, the scatterplots indicated that there would be a lot of noise in the data, given the wide range of song attributes. The way to analyze the data, I determined, would be to use subsets.
As all of the songs in this dataset were among the most popular at one time, I wanted to analyze what effect time, as well as some of metadata features, would impact a song's popularity in 2014.
songs_1 = songs[songs['rank'] == 1]
songs_1['date'] = pd.to_datetime(songs_1['date'])
songs_1.plot(x='date',y='popularity',figsize=(25,8))
plt.suptitle('Time Series of Current Popularity of #1 Songs',size =25)
plt.ylabel('Current Popularity')
<matplotlib.text.Text at 0x12422668>
The time series plot demonstrates the upward trend of number one songs' popularity as we get closer to the present. Especially interesting is that the upward pace is accelerated around 1990. As technologically savvy millennials are among the largest customer base for Spotify, the hit songs from their lifetime, even their earlier years, may still be showing up on playlists, providing a boost to the popularity metric.
The upward trend is even more greatly accelerated around 2011, which marks Spotify’s deal with Facebook to turn music into a social experience. Songs from this year and beyond have a distinct advantage in the popularity metric as they are more likely to benefit from the influx of new users to generate more listens.
For the linear regression of the Echonest metadata features and elapsed time vs current song popularity, I used an Ordinary Least Squares model as my target was continuous. Moreover, as there was much variance in my data, I wanted to avoid possible overfitting that would occur with a Random Forest or Naive Bayes regressor.
To clean the data and prep it for the model, I created a script that found every song’s highest rank on the Weekly Top 40 and the latest day it achieved that ranking. Moreover, I created a variable that would show the time elapsed from that specific date in days.
top_songs = pd.read_csv('C:\Users\Matt\Desktop\Billboard_Top_40\Data/top_rank.csv')
songs2 = songs.copy()
songs2 = songs2.drop(['date','rank'],axis=1)
songs2 = songs2.drop_duplicates()
top_songs = pd.merge(songs2,top_songs,on='uri')
top_songs['date'] = pd.to_datetime(top_songs['date'])
top_songs['time'] = top_songs.date.max() - top_songs.date
top_songs['days'] = (top_songs.time /np.timedelta64(1, 'D')).astype(int)
from sklearn import linear_model
clf = linear_model.LinearRegression()
features = ['danceability','duration','energy','instrumentalness','key','liveness','loudness','tempo','time_signature','days']
X = top_songs[features]
X = X.values
y = top_songs['popularity']
y = y.values
To train and score my model, I used a train/test split, which, as the name implies, breaks the dataset up into a training set and a testing set. The training set is fed into the model to 'teach' the feature variations and their relationship to the target variable and the test set is used to simulate new data that the computer has not seen. The computer will then score how well the model performs. In the case of the OLS model, the computer will provide R-Squared values.
from sklearn.cross_validation import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(X, y)
model = clf.fit(xtrain,ytrain)
print 'Train Score: ', model.score(xtrain,ytrain)
print 'Test Score: ', model.score(xtest,ytest)
Train Score: 0.342880893741 Test Score: 0.294135635247
Both the training and test scores indicate a relatively poor linear fit for the metadata features and elapsed time vs the target value of popularity. As expected, the model does not show signs of overfitting as the test and train score values are nearly identical.
print pd.DataFrame(zip(features,model.coef_.T),columns=['Variable','Coefficient'])
from sklearn import feature_selection
f = feature_selection.f_regression(X,y)
print
print pd.DataFrame(zip(features,f[1].T),columns=['Variable','P-Value'])
Variable Coefficient 0 danceability 0.002759 1 duration 0.000081 2 energy -0.056066 3 instrumentalness -0.087992 4 key -0.001231 5 liveness -0.053201 6 loudness 0.007141 7 tempo -0.000077 8 time_signature 0.011024 9 days -0.000018 Variable P-Value 0 danceability 2.247256e-49 1 duration 9.562488e-142 2 energy 2.802139e-114 3 instrumentalness 2.614871e-42 4 key 9.712986e-01 5 liveness 1.640046e-15 6 loudness 1.049723e-303 7 tempo 7.171618e-01 8 time_signature 5.298900e-21 9 days 0.000000e+00
The P-Value for elapsed days indicates that it is not significant, however, we saw in an earlier plot that elapsed time does have an impact on a song's current popularity. This unexpected P-value may be the result of multicollinearity.
Given the extremely low P-value for loudness, let's examine the relationship between elapsed days and song loudness.
plt.scatter(x=top_songs.days,y=top_songs.loudness)
plt.ylabel('loudness (decibels)',size=10)
plt.xlabel('Elapsed Days (Max Top 40 Ranking to 9/2014',size=10)
plt.suptitle('Loudness vs Days Elapsed')
plt.xticks(size=8)
plt.yticks(size=8)
(array([-50., -40., -30., -20., -10., 0., 10.]), <a list of 7 Text yticklabel objects>)
There appears to be a negative correlation between elapsed days and loudness as elapsed days increases. Given the existing correlation between these variables, this relationship may be the source of the multicollinearity in the model.
I began removing features from the model until I was left with the most significant: elapsed days.
newer_feature = ['days']
X = top_songs[newer_feature]
X = X.values
y = top_songs['popularity']
y = y.values
xtrain,xtest,ytrain,ytest = train_test_split(X, y)
model = clf.fit(xtrain,ytrain)
print 'Train Score: ', model.score(xtrain,ytrain)
print 'Test Score: ', model.score(xtest,ytest)
Train Score: 0.342880893741 Test Score: 0.294135635247
The model performed only slightly worse with just one feature. This result, however, indicates that the Echonest metadata features, in an OLS regression model, have little influence in determining a song's current popularity.
As the OLS regression model proved ineffective in determining continuous song popularity values, a logistic regression model, in which the y-value (song popularity) is a discrete classification value, may prove a stronger fit. For this model, the data will be split by specific quantiles.
top_songs['log_speech'] = np.log(top_songs['speechiness'])
top_songs['log_speech'] = top_songs['log_speech'] - top_songs['log_speech'].min()
top_songs['log_instrument'] = np.log(top_songs['instrumentalness'])
top_songs['log_instrument'] = top_songs['log_instrument'] - top_songs['log_instrument'].min()
top_songs['loudness'] = top_songs['loudness'] - top_songs['loudness'].min()
The data will be divided into halves, with the top half being songs with popularities at or above the median popularity value for the dataset.
top_songs['med_pop'] = (top_songs['popularity'] >= top_songs['popularity'].median()).astype('int')
top_songs = top_songs.dropna()
features = ['danceability','duration','energy','key','liveness','loudness','tempo','time_signature','log_instrument','log_speech']
X = top_songs[features]
X = X.values
y = top_songs['med_pop']
y = y.values
xtrain,xtest,ytrain,ytest = train_test_split(X, y)
log = linear_model.LogisticRegression()
model = log.fit(xtrain,ytrain)
print 'Train Score: ', model.score(xtrain,ytrain)
print 'Test Score: ', model.score(xtest,ytest)
Train Score: 0.673605228125 Test Score: 0.69219600726
The logistic regression model achieves a much stronger score than the OLS regression.
plt.figure(figsize=(20,20))
for i in range(len(features)):
v = i +1
ax1 = top_songs.boxplot(features[i],by='med_pop',ax=subplot(6,2,v))
ax1.set_title(str(features[i]),fontsize=15)
ax1.set_xlabel('')
plt.suptitle('Feature Boxplots by Median Quantile',size=20)
<matplotlib.text.Text at 0x20742c18>
According to the boxplots, songs in the top median quantile of popularity demonstrate a slightly higher level of verbosity, energy, and loudness than songs in the lower median quantile.
The data will be divided into 4 groups along quartiles.
def quartile(pop):
if pop < .25:
quartile = 1
return quartile
elif .25 <= pop < .5:
quartile = 2
return quartile
elif .5 <= pop < .75:
quartile = 3
return quartile
else:
quartile = 4
return quartile
top_songs['quart_pop'] = top_songs['popularity'].apply(quartile).astype('int')
features = ['danceability','duration','energy','key','liveness','loudness','tempo','time_signature','log_instrument','log_speech']
X = top_songs[features]
X = X.values
y = top_songs['quart_pop']
y = y.values
xtrain,xtest,ytrain,ytest = train_test_split(X, y)
model = log.fit(xtrain,ytrain)
print 'Train Score: ', model.score(xtrain,ytrain)
print 'Test Score: ', model.score(xtest,ytest)
Train Score: 0.505748517488 Test Score: 0.503811252269
Unsuprisingly, the model performs worse as more classification categories are added. However, the model is scoring slightly upwards of 50%, meaning that there is some relationship between the Echonest metadata features and the popularity quartiles.
plt.figure(figsize=(20,20))
for i in range(len(features)):
v = i +1
ax1 = top_songs.boxplot(features[i],by='quart_pop',ax=subplot(6,2,v))
ax1.set_title(str(features[i]),fontsize=15)
ax1.set_xlabel('')
plt.suptitle('Feature Boxplots by Quartile',size=20)
<matplotlib.text.Text at 0x1a0f6e10>
The boxplots demonstrate that songs with higher levels of energy, loudness, and danceability tend to achieve higher levels of popularity. Moreover, it appears that listeners also prefer studio recorded songs rather than live audience performances, as indicated by the liveness metric.
As the area plot of genres in the Billboard Top 40 Plot demonstrates, rock music has had a steady influence on American pop culture for the past half century. However, from the Beatles to Metallica to Bruno Mars, the sound of rock is ever changing. Those sound attributes are possibly strong enough to predict from which decade the song comes.
song_genre['log_speech'] = np.log(song_genre['speechiness'])
song_genre['log_speech'] = song_genre['log_speech'] - song_genre['log_speech'].min()
song_genre['log_instrument'] = np.log(song_genre['instrumentalness'])
song_genre['log_instrument'] = song_genre['log_instrument'] - song_genre['log_instrument'].min()
rock = song_genre[song_genre['genre']=='rock']
def decade(date):
year = str(date).split('-')[0]
last_num = year[3]
decade = int(year) - int(last_num)
return decade
rock['decade'] = rock['date'].apply(decade)
rock = rock.drop(['rank','date'],axis=1)
rock = rock.drop_duplicates()
rock = rock.reset_index(drop=True)
rock.head()
artist_uri | genre | uri | title | artist | popularity | danceability | duration | energy | instrumentalness | key | liveness | loudness | speechiness | tempo | time_signature | log_speech | log_instrument | decade | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | spotify:artist:5zNOI87gG4RttFmYAZWaxQ | rock | 3hvakqVpwaz4L7zN5HfTCY | Why | Frankie Avalon | 0.27 | 0.422345 | 155.23955 | 0.409041 | 5.260000e-09 | 5 | 0.112646 | -8.540 | 0.026997 | 94.986 | 4 | 0.189548 | 14.067097 | 1960 |
1 | spotify:artist:5zNOI87gG4RttFmYAZWaxQ | rock | 3nCheE8HdpEvdq4YL85Zsq | Swingin` On A Rainbow | Frankie Avalon | 0.05 | 0.587611 | 112.96730 | 0.430452 | 1.480000e-09 | 0 | 0.167254 | -9.372 | 0.031134 | 76.617 | 4 | 0.332105 | 12.799008 | 1960 |
2 | spotify:artist:5zNOI87gG4RttFmYAZWaxQ | rock | 2tUkg75XKOLolMTSCTiJdc | Don�t Throw Away All Those Teardrops | Frankie Avalon | 0.07 | 0.550525 | 147.90757 | 0.249479 | 5.190464e-03 | 7 | 0.375358 | -15.312 | 0.028428 | 82.363 | 3 | 0.241187 | 27.869299 | 1960 |
3 | spotify:artist:5zNOI87gG4RttFmYAZWaxQ | rock | 4BbswtciArUND4s8zbumAq | Where Are You | Frankie Avalon | 0.00 | 0.247867 | 161.50617 | 0.240684 | 7.153127e-01 | 1 | 0.233915 | -14.473 | 0.030635 | 71.348 | 4 | 0.315975 | 32.795196 | 1960 |
4 | spotify:artist:5zNOI87gG4RttFmYAZWaxQ | rock | 7EPj2koNvl0TlCo7EaqmSe | Togetherness | Frankie Avalon | 0.00 | 0.562863 | 149.86889 | 0.385338 | 2.730000e-07 | 0 | 0.187181 | -11.562 | 0.028235 | 119.011 | 4 | 0.234381 | 18.016437 | 1960 |
features = ['danceability','duration','energy','key','liveness','loudness','tempo','time_signature','log_instrument','log_speech']
rock = rock.dropna()
plt.figure(figsize=(20,20))
for i in range(len(features)):
v = i +1
ax1 = subplot(5,2,v)
ax1.hist(rock[features[i]])
ax1.set_title(str(features[i]),fontsize=15)
The features appear to be mostly normally distributed, including verbosity and instrumentalness which were transformed using the log function.
plt.figure(figsize=(20,20))
for i in range(len(features)):
v = i +1
ax1 = rock.boxplot(features[i],by='decade',ax=subplot(5,2,v))
ax1.set_title(str(features[i]),fontsize=15)
ax1.set_xlabel('')
plt.suptitle('Rock Feature Boxplots by Decade',size=20)
<matplotlib.text.Text at 0x2c2f35c0>
rock = rock.dropna()
rock['loudness'] = rock['loudness']-rock['loudness'].min()
X = rock[features]
Y = rock['decade']
xtrain,xtest,ytrain,ytest = train_test_split(X,Y)
plt.hist(ytrain)
plt.xlabel('Song Decade', size=12)
plt.suptitle('Distribution of Songs in the Training Set')
<matplotlib.text.Text at 0x2c8e7ba8>
Upon examination of the training set, the target values are skewed towards the earlier decades in the dataset. With the high variation in the data, a Naïve Bayes model, although a strong algorithm, will not pick up on the nuances in the song metadata to make effective predictions to the song’s decade of origin. This is because Naive Bayes assumes that all features are independent of one another.
In place of Naïve Bayes, I will use a Random Forest model. Although very prone to overfitting on the training set in machine learning models, this model is strong enough to pick up on the subtle variations in the data, as it does not perceive all of the features to be in a vacuum like the Bayes algorithm. Moreover, as the rock song dataset only consists of 5,056 unique songs, scaling with hundreds decision trees should not prove to be an issue.
from sklearn.ensemble import RandomForestClassifier
The random forest model is fit with 800 decision trees to increase precision.
rf = RandomForestClassifier(n_estimators=800).fit(xtrain,ytrain)
print 'Train: ', rf.score(xtrain,ytrain)
print 'Test: ', rf.score(xtest, ytest)
Train: 0.983386075949 Test: 0.537974683544
As expected, the model was overfit to the training set, indicated by the 98% training score. While 54% on the test set is not a noteworthy score, compared to a random 1 in 6 chance of picking the decade correctly (about 17%), the Random Forest model picked up on patterns in the metadata features and accurately predicted over half of the decade values.
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
y_pred = rf.predict(xtest)
decades = ['1960','1970','1980','1990','2000','2010']
matrix = pd.DataFrame(confusion_matrix(ytest,y_pred),columns=decades,index=decades)
print 'Confusion Matrix'
print
print matrix, '\n'
print 'Classification Report'
print
print classification_report(ytest,y_pred,target_names=decades),"\n"
Confusion Matrix 1960 1970 1980 1990 2000 2010 1960 249 52 7 2 2 0 1970 68 142 86 11 10 2 1980 17 65 217 19 17 0 1990 2 27 73 13 14 1 2000 2 8 35 4 40 10 2010 5 7 14 5 19 19 Classification Report precision recall f1-score support 1960 0.73 0.80 0.76 312 1970 0.47 0.45 0.46 319 1980 0.50 0.65 0.57 335 1990 0.24 0.10 0.14 130 2000 0.39 0.40 0.40 99 2010 0.59 0.28 0.38 69 avg / total 0.52 0.54 0.52 1264
According to the confusion matrix and the classification report, the model appears to have performed the best predicting songs from the 1960s and the worst with songs from the 1990s, which it mistook for 1980s rock.
NOTE: In reading a confusion matrix, the row values represent the actual target classification values and the columns represent what the computer predicted. For example, in row 1, column 2 of the matrix, the computer predicted that 52 rock songs belonged to the 1970s when, in fact, they originated in the 1960s.
While a rock song's metadata attributes, on a marginal level, could predict from which decade the song came, the metadata may actually do a better job of predicting a song's overall genre in general. As different genres possess varying feature levels in their metadata, the computer can more accurately make a prediction of that genre from a training sample of songs.
genre = ['r&b','hip hop','country']
genre_subset = song_genre[song_genre['genre'].isin(genre)]
genre_subset = genre_subset.drop(['date','rank'],axis=1)
genre_subset = genre_subset.drop_duplicates()
plt.figure(figsize=(20,20))
for i in range(len(features)):
v = i +1
ax1 = genre_subset.boxplot(features[i],by='genre',ax=subplot(5,2,v))
ax1.set_title(str(features[i]),fontsize=15)
ax1.set_xlabel('')
plt.suptitle('Feature Boxplots by Genre',size=20)
<matplotlib.text.Text at 0x36f436a0>
Not surprisingly, hip hop music is more danceable, energetic, and verbose than country or R&B. Country music, on the other hand, is slightly more instrumental. These differences in sound attributes should allow for more effective prediction of genres by the computer.
def genre_num(lst):
index = genre.index(lst)
return index
genre_subset['genre_num'] = genre_subset['genre'].apply(genre_num)
genre_subset = genre_subset.dropna()
X = genre_subset[features]
Y = genre_subset['genre_num']
xtrain,xtest,ytrain,ytest = train_test_split(X,Y)
Despite overfitting to the training set in the rock songs by decade classification, the Random Forest Classifier proved capable enough to predict over half of the test values correctly. Due to that predictive power through all of the feature variation, the Random Forest Classifier is the model of choice for this situation, as well.
rf = RandomForestClassifier(n_estimators=800).fit(xtrain,ytrain)
print 'Train: ', rf.score(xtrain,ytrain)
print 'Test: ', rf.score(xtest, ytest)
Train: 1.0 Test: 0.697478991597
The classifier once again overfits to the training set, achieving a perfect score on the learning data. The model, however, appears to be a good fit for this data with a score of about 70%. The high variation between the selected genres’ metadata, as well as the power of the Random Forest algorithm, has led to a fairly accurate predictive model for music genres.
y_pred = rf.predict(xtest)
decades = ['1960','1970','1980','1990','2000','2010']
matrix = pd.DataFrame(confusion_matrix(ytest,y_pred),columns=genre,index=genre)
print 'Confusion Matrix'
print
print matrix, '\n'
print 'Classification Report'
print
print classification_report(ytest,y_pred,target_names=genre),"\n"
Confusion Matrix r&b hip hop country r&b 200 32 99 hip hop 35 90 2 country 46 2 208 Classification Report precision recall f1-score support r&b 0.71 0.60 0.65 331 hip hop 0.73 0.71 0.72 127 country 0.67 0.81 0.74 256 avg / total 0.70 0.70 0.69 714
According to the classification report, the model performed the best on the country genre with a recall score of 79%. In contrast, the model predicted that 41 hip hop songs were in the R&B genre, dropping hip hop's recall score to 66%.
As the computer was able to predict music genres through song metadata, it may be interesting to see how the computer builds its own genres with that same set of data.
The K-Means algorithm is used for classifying distinct clusters of data through similar feature attributes. Feature selection for this model will be important as K-Means will take advantage of large variances in a particular feature to subdivide the data.
Given the proper feature selection, K-Means may serve as a strong base algorithm for a music recommendation engine.
genre = ['r&b','hip hop','country','rock']
features = ['danceability','energy','liveness','time_signature','log_speech']
genre_subset = song_genre[song_genre['genre'].isin(genre)]
genre_subset = genre_subset.drop(['date','rank'],axis=1)
genre_subset = genre_subset.drop_duplicates()
plt.figure(figsize=(20,20))
for i in range(len(features)):
v = i +1
ax1 = genre_subset.boxplot(features[i],by='genre',ax=subplot(5,2,v))
ax1.set_title(str(features[i]),fontsize=15)
ax1.set_xlabel('')
plt.suptitle('Feature Boxplots by Genre',size=20)
<matplotlib.text.Text at 0x3c81f908>
for i in genre:
cluster_genre = genre_subset[genre_subset['genre']==i]
print 'Top 10 Songs for Genre: %s' %i
print
print cluster_genre[['title','artist','popularity','genre']].sort('popularity',ascending=False).head(10)
print
Top 10 Songs for Genre: r&b title artist popularity genre 98244 All of Me John Legend 0.96 r&b 109074 Bang Bang Jessie J 0.91 r&b 107679 Wiggle (feat. Snoop Dogg) Jason Derulo 0.90 r&b 111827 Me And My Broken Heart Rixton 0.88 r&b 95037 Drunk in Love Beyonc� 0.88 r&b 111037 Break Free Ariana Grande 0.87 r&b 111017 Problem Ariana Grande 0.86 r&b 106012 Don't Tell 'Em Jeremih 0.85 r&b 111970 2 On Tinashe 0.82 r&b 100347 New Flame Chris Brown 0.82 r&b Top 10 Songs for Genre: hip hop title artist popularity \ 111810 Classic MKTO 0.92 111660 Turn Down for What DJ Snake 0.91 111802 Black Widow Iggy Azalea 0.90 111792 Fancy Iggy Azalea 0.89 110791 Can't Hold Us - feat. Ray Dalton Macklemore & Ryan Lewis 0.89 95283 Happy Pharrell Williams 0.87 96674 Timber Pitbull 0.86 95318 Come Get It Bae Pharrell Williams 0.86 111779 Fancy Iggy Azalea 0.84 96708 Wild Wild Love Pitbull 0.84 genre 111810 hip hop 111660 hip hop 111802 hip hop 111792 hip hop 110791 hip hop 95283 hip hop 96674 hip hop 95318 hip hop 111779 hip hop 96708 hip hop Top 10 Songs for Genre: country title artist popularity genre 101556 Burnin' It Down Jason Aldean 0.85 country 110642 Dirt Florida Georgia Line 0.84 country 85629 American Kids Kenny Chesney 0.83 country 110624 This Is How We Roll Florida Georgia Line 0.81 country 107489 Play It Again Luke Bryan 0.80 country 95474 Drunk On A Plane Dierks Bentley 0.80 country 109348 Beachin' Jake Owen 0.79 country 106214 Bartender Lady Antebellum 0.78 country 110572 Cruise Florida Georgia Line 0.77 country 110516 Where It's At Dustin Lynch 0.76 country Top 10 Songs for Genre: rock title artist popularity genre 94681 A Sky Full Of Stars Coldplay 0.96 rock 111899 Rude Magic! 0.94 rock 94691 Magic Coldplay 0.91 rock 105275 This Is How We Do Katy Perry 0.90 rock 105209 Dark Horse Katy Perry 0.89 rock 103259 Maps Maroon 5 0.87 rock 103708 Ain't It Fun Paramore 0.87 rock 94655 Paradise Coldplay 0.84 rock 108576 When I Was Your Man Bruno Mars 0.84 rock 108604 Treasure Bruno Mars 0.83 rock
from sklearn.cluster import KMeans
features = ['danceability','energy','liveness','time_signature','log_speech']
genre_subset = genre_subset.dropna()
X = genre_subset[features]
As I put 4 genres of songs into the model, I have programed the model to return 4 new genres for an equal comparison.
km = KMeans(n_clusters=len(genre)).fit(X)
genre_subset['prediction'] = km.predict(genre_subset[features])
plt.figure(figsize=(20,20))
for i in range(len(features)):
v = i +1
ax1 = genre_subset.boxplot(features[i],by='prediction',ax=subplot(5,2,v))
ax1.set_title(str(features[i]),fontsize=15)
ax1.set_xlabel('')
plt.suptitle('Feature Boxplots by Clustered Genre',size=20)
<matplotlib.text.Text at 0x35bd96a0>
According to the boxplots, the K-Means model heavily clustered the data by the verbosity as that attribute demonstrated the most variation in this particular set of features. To a lesser extent, the model also divided the songs by danceability and energy.
from IPython.display import HTML
genre_dict ={}
for i in sort(genre_subset['prediction'].unique()):
cluster_genre = genre_subset[genre_subset['prediction']==i]
cluster_genre = cluster_genre.sort('popularity',ascending=False).reset_index()
print 'Top 10 Songs for Cluster Genre %s' %i
print
print cluster_genre[['title','artist','popularity','genre']].head(10)
print
genre_dict[i] = list(cluster_genre['uri'][0:20])
genre_dict[i]= ','.join(genre_dict[i])
genre_dict[i] = '<iframe src="https://embed.spotify.com/?uri=spotify:trackset:Genre %s Top 20:%s" frameborder="0" allowtransparency="true"></iframe>' % (i, genre_dict[i])
Top 10 Songs for Cluster Genre 0 title artist \ 0 Turn Down for What DJ Snake 1 Happy Pharrell Williams 2 Problem Ariana Grande 3 Thrift Shop - feat. Wanz Macklemore & Ryan Lewis 4 Show Me Kid Ink 5 Empire State Of Mind [Jay-Z + Alicia Keys] JAY Z 6 Feel Good Inc Gorillaz 7 Crazy in Love Beyonc� 8 Ms. Jackson OutKast 9 No Flex Zone Rae Sremmurd popularity genre 0 0.91 hip hop 1 0.87 hip hop 2 0.86 r&b 3 0.83 hip hop 4 0.80 hip hop 5 0.80 hip hop 6 0.79 rock 7 0.78 r&b 8 0.78 hip hop 9 0.78 hip hop Top 10 Songs for Cluster Genre 1 title artist popularity genre 0 All of Me John Legend 0.96 r&b 1 A Sky Full Of Stars Coldplay 0.96 rock 2 Rude Magic! 0.94 rock 3 Magic Coldplay 0.91 rock 4 Me And My Broken Heart Rixton 0.88 r&b 5 Drunk in Love Beyonc� 0.88 r&b 6 Maps Maroon 5 0.87 rock 7 Break Free Ariana Grande 0.87 r&b 8 Paradise Coldplay 0.84 rock 9 When I Was Your Man Bruno Mars 0.84 rock Top 10 Songs for Cluster Genre 2 title artist popularity \ 0 Classic MKTO 0.92 1 Bang Bang Jessie J 0.91 2 This Is How We Do Katy Perry 0.90 3 Black Widow Iggy Azalea 0.90 4 Can't Hold Us - feat. Ray Dalton Macklemore & Ryan Lewis 0.89 5 Fancy Iggy Azalea 0.89 6 Dark Horse Katy Perry 0.89 7 Ain't It Fun Paramore 0.87 8 Timber Pitbull 0.86 9 Come Get It Bae Pharrell Williams 0.86 genre 0 hip hop 1 r&b 2 rock 3 hip hop 4 hip hop 5 hip hop 6 rock 7 rock 8 hip hop 9 hip hop Top 10 Songs for Cluster Genre 3 title artist popularity genre 0 I Won't Give Up Jason Mraz 0.81 rock 1 Iris The Goo Goo Dolls 0.79 rock 2 What Now Rihanna 0.74 r&b 3 Piano Man Billy Joel 0.74 rock 4 If I Ain't Got You Alicia Keys 0.73 r&b 5 Hold the Line Toto 0.73 rock 6 A Thousand Years Christina Perri 0.72 rock 7 The Only Exception Paramore 0.72 rock 8 Everybody Hurts R.E.M. 0.70 rock 9 Fade Into You Mazzy Star 0.70 rock
HTML(genre_dict[0])
As the boxplots above indicate, Genre 0 songs are high in energy, medium in danceability and verbosity.
HTML(genre_dict[1])
Genre 1 represents songs with medium levels of energy and lower levels of verbosity.
HTML(genre_dict[2])
Genre 2 songs are the most energetic and verbose of the 4 clustered genres. Unsurprisingly, many of these songs originally come from the Hip Hop genre.
HTML(genre_dict[3])
Songs in Genre 3 are the least energetic, verbose, and danceable of the songs in the 4 clustered genres. This genre is comprised of many Rock and R&B ballads.
Through the use of song metadata for the Billboard Top 40 for the past half century and OLS, Logistic, Random Forest, and K-Means machine learning algorithms, I determined that time plays a larger role in the popularity attrition of a song than any metadata factor in an OLS linear model. Using a categorial logistic regression model, however, I found that loudness, energy, and danceability, to some extent, account for some of the gains in song popularity.
Moreover, depending on the strength of the algorithm used, song metadata can be a decent predictor to determine from which decade a particular song comes. The metadata truly shines, however, in the classification and clustering of genres. In developing my own genre clusters with K-Means, I was amazed how accurately the computer picked like-sounding songs through only 5 feature attributes.
As the genre area plot demonstrated, we as a culture are getting more diverse in terms of the music to which we listen. With the power of machine learning, however, services like Spotify, Google Music, and Pandora will never be at a loss for songs to meet a particular individual’s tastes, despite the ever increasing scope of possibilities.
As the metadata proved to be a powerful clustering tool of songs, it may be possible to develop a refined music recommendation algorithm based on K-means classification. While my model worked with this particular dataset, I will need to include a wider variety of songs and continuously test features to make the music classifier more robust.
As Echonest updates its API with more song metadata, including Valence or song positivity, it may be interesting to revise my models to determine the impact of these new features. Moreover, Echonest could potentially provide additional metadata, such as if a song has explicit lyrics, to make the recommendation engine more aligned to customer tastes.
For more information on this dataset and the services that provided it, please visit:
For more code and the datasets for this project, please visit my GitHub page: