Loving you is complicated: Quantifying musical sentiment and success through data science¶

Spotify has quickly become the most popular music streaming service in the world with over 271 million active users every month. A close collaborator of Spotify is Genius, a website with 26.5 million monthly users that allows members of their community to upload, annotate, and interpret lyrics from music artists. Inspired by the blog of Thompson Analytics I will use the Spotify API, the Genius API and the NRC Emotion Lexicon to quantify both the musical sentiment and lyrical sentiment of one of the most prominentes artists of our time: Kendrick Lamar. A visualization of his discography will accompany this process of data collection. In addition to this, I will track the data of the top 100 most streamed songs of all time and will use different regression techniques to determine the usefulness of these features when it comes to the prediction of musical success.

PART I: Collecting Kendrick Lamar's Discography.¶

Spotify features¶

The Spotify API assigns musical features to every track on their platform. These features are defined in the following way:

Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
Danceability: Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
Energy: A measure from 0.0 to 1.0 that represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
Instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
Speechiness: Detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

I will be using these features for visualization and analytical purposes in the first and second part respectively.

Getting the Data from Spotify¶

In order to get the data from spotify I will be using Spotipy , a lightweight Python library for the Spotify Web API:

In [1]:

pip install spotipy --upgrade

Requirement already up-to-date: spotipy in /opt/conda/lib/python3.7/site-packages (2.11.2)
Requirement already satisfied, skipping upgrade: six>=1.10.0 in /opt/conda/lib/python3.7/site-packages (from spotipy) (1.14.0)
Requirement already satisfied, skipping upgrade: requests>=2.20.0 in /opt/conda/lib/python3.7/site-packages (from spotipy) (2.23.0)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->spotipy) (2019.11.28)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->spotipy) (1.25.7)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->spotipy) (2.9)
Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->spotipy) (3.0.4)
Note: you may need to restart the kernel to use updated packages.

I will be using Plotly for most visualization tasks since I love how crisp it looks. For the second part of this project I will be relying on Scikit-learn for the computation of different prediction model. In addition to this I need to set the credentials obtained from my Spotify API application to access their data.

In [2]:

# import libraries  
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
import time
from datetime import date
import matplotlib.pyplot as plt
import pandas as pd
from math import pi
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import numpy as np
from sklearn import (linear_model, metrics, neural_network, pipeline, model_selection, preprocessing, pipeline)
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from scipy import stats

# authenticate and connect to the Spotify API
client_id = '547191f3201147df8e76a2aa96607aa3'
client_secret = '43cb18993cf04a53866a3690d1796600'
client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Next, I define a series of functions that will allow me to gather data from the Spotufy API and put it into a dataframe:

In [3]:

# Definining functions 

def get_artist(name):
    # Returns the artist object with respective data
    results = sp.search(q='artist:' + name, type='artist')
    items = results['artists']['items']
    if len(items) > 0:
        return items[0]
    else:
        return None

def get_artist_albums_ids(artist):
    # Given an artist object, returns a list of with each of the album's ids
    albums_ids = []
    for i in range (len(sp.artist_albums(artist['id'], album_type='album',limit=50)['items'])):
        albums_ids.append(sp.artist_albums(artist['id'], album_type='album',limit=50)['items'][i]['id'])
    return albums_ids

def filter_albums_ids(albums_ids):
    # Spotify has many versions of the same album. 
    # This function filters a list of albums ids and returns the most popular ones
    album_names = []    
    for album_id in albums_ids:
        album_names.append(sp.album(album_id)['name'])       
    album_pop = []    
    for album_id in albums_ids:
        album_pop.append(sp.album(album_id)['popularity'])    
    d = {'id': albums_ids,'name': album_names, 'popularity': album_pop}
    df = pd.DataFrame(data=d)    
    df = df.sort_values('popularity', ascending=False).drop_duplicates('name').sort_index()
    df = df.reset_index(drop = True)   
    return (df['id'])

def get_artist_albums_tracks_ids(albums_ids):
    # Given a list of albums ids, returns a list of ids of the albums' tracks
    tracks_ids = []
    for album_id in albums_ids:
        for i in range(len(sp.album_tracks(album_id)['items'])):
            tracks_ids.append(sp.album_tracks(album_id)['items'][i]['id'])
    return(tracks_ids)

def get_track_features(id):
    # Given a track id, returns a nested list with its musical features as described in the introduction
    meta = sp.track(id)
    features = sp.audio_features(id)

    # Meta
    name = meta['name']
    album = meta['album']['name']
    artist = meta['album']['artists'][0]['name']
    release_date = meta['album']['release_date']
    length = meta['duration_ms']
    popularity = meta['popularity']

    # Features
    acousticness = features[0]['acousticness']
    danceability = features[0]['danceability']
    energy = features[0]['energy']
    instrumentalness = features[0]['instrumentalness']
    liveness = features[0]['liveness']
    loudness = features[0]['loudness']
    speechiness = features[0]['speechiness']
    valence = features[0]['valence']
    tempo = features[0]['tempo']
    time_signature = features[0]['time_signature']

    track = [name, album, artist, release_date, length, popularity, danceability, acousticness, energy, instrumentalness, liveness, loudness, speechiness, valence, tempo, time_signature]
    return track

def get_tracks_features(tracks_ids):
    # given a list of ids of tracks, returns a list with their features
    tracks = []
    for i in range(0, len(tracks_ids)):
        track = get_track_features(tracks_ids[i])
        tracks.append(track)
    return(tracks)

def tracks_features_to_csv(tracks_features, csv_title):
    # transforms the list of track features into a csv file
    df = pd.DataFrame(tracks_features, columns = ['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'valence','tempo', 'time_signature'])
    df.to_csv(csv_title + ".csv", sep = ',')
    
def get_discography(artist_name):
    # a function that returns the discography of an artist given its name (string)
    # do not use, it is quite unreliable since the discography is not filtered
    tracks_features = get_tracks_features(get_artist_albums_tracks_ids(filter_albums_ids(get_artist_albums_ids(get_artist(artist_name)))))
    return tracks_features_to_csv(tracks_features, artist_name)

def get_playlist_track_ids(user, playlist_id):
    # initially I wanted to use a playlist to get an artist's discography
    # however, finding a good playlist proved to be quite difficult
    ids = []
    playlist = sp.user_playlist(user, playlist_id)
    for item in playlist['tracks']['items']:
        track = item['track']
        ids.append(track['id'])
    return ids

Done! Let's see the functions in action.

In [4]:

# first, we find the artist object
kendrick = get_artist("Kendrick Lamar")
# we look for his albums
albums_ids = get_artist_albums_ids(kendrick)
# we filter this list since spotify has many versions of the same album. We keep the most popular one
filtered_albums_ids = filter_albums_ids(albums_ids)
# we get the id of every track in his discography
tracks_ids = get_artist_albums_tracks_ids(filtered_albums_ids)
# we get the features of every track
tracks_features = get_tracks_features(tracks_ids)
# we export the data to csv format
tracks_features_to_csv(tracks_features, "Kendrick Lamar")

The API tends to be quite unreliable. At times the process might yield an error so the previous step might require more than one attempo. Anyways, I got the data in the second try, let's see how it looks

In [5]:

df_kendrick = pd.read_csv("Kendrick Lamar.csv",index_col=[0])
print(df_kendrick['album'].unique())
df_kendrick

['Black Panther The Album Music From And Inspired By'
 'DAMN. COLLECTORS EDITION.' 'DAMN.' 'untitled unmastered.'
 'To Pimp A Butterfly' 'good kid, m.A.A.d city (Deluxe)'
 'good kid, m.A.A.d city' 'Section.80' 'Overly Dedicated']

Out[5]:

	name	album	artist	release_date	length	popularity	danceability	acousticness	energy	instrumentalness	liveness	loudness	speechiness	valence	tempo	time_signature
0	Black Panther	Black Panther The Album Music From And Inspire...	Kendrick Lamar	2018-02-09	130613	58	0.618	0.6250	0.582	0.000004	0.2650	-9.454	0.2970	0.480	90.035	4
1	All The Stars (with SZA)	Black Panther The Album Music From And Inspire...	Kendrick Lamar	2018-02-09	232186	79	0.698	0.0605	0.633	0.000194	0.0926	-4.946	0.0597	0.552	96.924	4
2	X (with 2 Chainz & Saudi)	Black Panther The Album Music From And Inspire...	Kendrick Lamar	2018-02-09	267426	70	0.768	0.0201	0.471	0.000000	0.2680	-8.406	0.2590	0.405	131.023	4
3	The Ways (with Swae Lee)	Black Panther The Album Music From And Inspire...	Kendrick Lamar	2018-02-09	238893	66	0.727	0.0626	0.720	0.000001	0.1760	-5.856	0.0488	0.589	140.080	4
4	Opps (with Yugen Blakrok)	Black Panther The Album Music From And Inspire...	Kendrick Lamar	2018-02-09	180893	60	0.706	0.1520	0.775	0.000033	0.4160	-6.819	0.3350	0.847	127.929	4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
118	Barbed Wire	Overly Dedicated	Kendrick Lamar	2010-09-14	265678	44	0.613	0.0152	0.843	0.000000	0.0917	-8.343	0.1860	0.325	102.968	4
119	Average Joe	Overly Dedicated	Kendrick Lamar	2010-09-14	256048	46	0.740	0.2430	0.733	0.000000	0.1840	-3.343	0.2480	0.218	91.603	4
120	H.O.C	Overly Dedicated	Kendrick Lamar	2010-09-14	316975	44	0.613	0.1050	0.591	0.000000	0.1910	-8.580	0.3910	0.371	77.124	4
121	Cut You Off (To Grow Closer)	Overly Dedicated	Kendrick Lamar	2010-09-14	364103	47	0.685	0.0770	0.681	0.000000	0.1290	-7.176	0.4810	0.614	82.982	4
122	She Needs Me (Remix)	Overly Dedicated	Kendrick Lamar	2010-09-14	195790	53	0.606	0.4170	0.835	0.000001	0.1030	-6.107	0.2650	0.313	100.145	4

123 rows × 16 columns

Looking good. However, notice that we still got some repeated observations because of the deluxe edition of good kid, m.A.A.d city and the collector's edition of DAMN.. I will keep the former and drop the latter because of popularity reasons. Moreover, I will be dropping the Black Panther's soundtrack since I believe it does not stand as a singular effort by Kendrick, but rather as a collective effort that seeks to accompany the movie.

In [6]:

df_kendrick = df_kendrick[df_kendrick['album'] != 'good kid, m.A.A.d city']
df_kendrick = df_kendrick[df_kendrick['album'] != 'Black Panther The Album Music From And Inspired By']
df_kendrick = df_kendrick[df_kendrick['album'] != 'DAMN. COLLECTORS EDITION.']
df_kendrick = df_kendrick.reset_index(drop=True)
df_kendrick.head()

Out[6]:

	name	album	artist	release_date	length	popularity	danceability	acousticness	energy	instrumentalness	liveness	loudness	speechiness	valence	tempo	time_signature
0	BLOOD.	DAMN.	Kendrick Lamar	2017-04-14	118066	61	0.357	0.14200	0.238	0.085900	0.5500	-16.780	0.265	0.494	156.907	4
1	DNA.	DAMN.	Kendrick Lamar	2017-04-14	185946	78	0.638	0.00454	0.523	0.000000	0.0842	-6.664	0.357	0.422	139.913	4
2	YAH.	DAMN.	Kendrick Lamar	2017-04-14	160293	64	0.670	0.57600	0.700	0.000005	0.2260	-7.893	0.196	0.648	69.986	4
3	ELEMENT.	DAMN.	Kendrick Lamar	2017-04-14	208733	71	0.748	0.20400	0.705	0.000000	0.2460	-4.547	0.485	0.483	189.891	4
4	FEEL.	DAMN.	Kendrick Lamar	2017-04-14	214826	64	0.746	0.13700	0.798	0.000000	0.1390	-8.382	0.349	0.553	109.968	4

Getting the data from Genius¶

I will be using the LyricsGenius package that provides a simple interface to the song, artist, and lyrics data stored on Genius.

In [7]:

pip install git+https://github.com/johnwmillr/LyricsGenius.git

Collecting git+https://github.com/johnwmillr/LyricsGenius.git
  Cloning https://github.com/johnwmillr/LyricsGenius.git to /tmp/pip-req-build-_w3rl_eu
  Running command git clone -q https://github.com/johnwmillr/LyricsGenius.git /tmp/pip-req-build-_w3rl_eu
Requirement already satisfied (use --upgrade to upgrade): lyricsgenius==1.8.2 from git+https://github.com/johnwmillr/LyricsGenius.git in /opt/conda/lib/python3.7/site-packages
Requirement already satisfied: beautifulsoup4==4.6.0 in /opt/conda/lib/python3.7/site-packages (from lyricsgenius==1.8.2) (4.6.0)
Requirement already satisfied: requests>=2.20.0 in /opt/conda/lib/python3.7/site-packages (from lyricsgenius==1.8.2) (2.23.0)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->lyricsgenius==1.8.2) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->lyricsgenius==1.8.2) (1.25.7)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->lyricsgenius==1.8.2) (2019.11.28)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->lyricsgenius==1.8.2) (2.9)
Building wheels for collected packages: lyricsgenius
  Building wheel for lyricsgenius (setup.py) ... done
  Created wheel for lyricsgenius: filename=lyricsgenius-1.8.2-py3-none-any.whl size=15038 sha256=8cd1776e3a526e58217a9324415b1e95849fa37880a194dc8f2d22f4fd7ab40f
  Stored in directory: /tmp/pip-ephem-wheel-cache-60w3__2c/wheels/12/d5/2b/6b771ebb067bceb8816ec5eef0dd0d36bf069b18f03ac8ca20
Successfully built lyricsgenius
Note: you may need to restart the kernel to use updated packages.

Like before, we set the credentials obtained from my Genius app.

In [8]:

import lyricsgenius

# Genius API
genius = lyricsgenius.Genius("BCf3jrIhyBKfxJSB7t8ELCDvW9ASizVNtK8VCaka8TuM3Igd3Tma5eDumwizSicV")
genius.remove_section_headers = True

My plan is to iterate through my dataframe looking for every song name and creating a list with the lyrics. I decided against applying a function to the dataframe since the Genius API is very unreliable, (it like to throw out errors at random: at times it works, at times is just does not!). Albeit this approach is more lengthy, it is much more reliable. Before doing any of this, I will need to do some name standarization between my dataframe and the Genius API (titles that include the word featuring are problematic). Please forgive the profanity in the following cell.

In [9]:

df_kendrick.loc[df_kendrick['name'] ==  "Swimming Pools (Drank) - Extended Version", 'name'] = "Swimming Pools (Drank)"
df_kendrick.loc[df_kendrick['name'] ==  "F*ck Your Ethnicity", 'name'] = "Fuck Your Ethnicity"
df_kendrick.loc[df_kendrick['name'] ==  "LOYALTY. FEAT. RIHANNA.", 'name'] = "LOYALTY."
df_kendrick.loc[df_kendrick['name'] ==  "LOVE. FEAT. ZACARI.", 'name'] = "LOVE."
df_kendrick.loc[df_kendrick['name'] ==  "XXX. FEAT. U2.", 'name'] = "XXX."

We are good to go! Let's see how the loop performs.

In [10]:

lyrics = []
for name in df_kendrick['name']:
    lyrics.append(genius.search_song(name, "Kendrick Lamar"))

Searching for "BLOOD." by Kendrick Lamar...
Done.
Searching for "DNA." by Kendrick Lamar...
Done.
Searching for "YAH." by Kendrick Lamar...
Done.
Searching for "ELEMENT." by Kendrick Lamar...
Done.
Searching for "FEEL." by Kendrick Lamar...
Done.
Searching for "LOYALTY." by Kendrick Lamar...
Done.
Searching for "PRIDE." by Kendrick Lamar...
Done.
Searching for "HUMBLE." by Kendrick Lamar...
Done.
Searching for "LUST." by Kendrick Lamar...
Done.
Searching for "LOVE." by Kendrick Lamar...
Done.
Searching for "XXX." by Kendrick Lamar...
Done.
Searching for "FEAR." by Kendrick Lamar...
Done.
Searching for "GOD." by Kendrick Lamar...
Done.
Searching for "DUCKWORTH." by Kendrick Lamar...
Done.
Searching for "untitled 01 | 08.19.2014." by Kendrick Lamar...
Done.
Searching for "untitled 02 | 06.23.2014." by Kendrick Lamar...
Done.
Searching for "untitled 03 | 05.28.2013." by Kendrick Lamar...
Done.
Searching for "untitled 04 | 08.14.2014." by Kendrick Lamar...
Done.
Searching for "untitled 05 | 09.21.2014." by Kendrick Lamar...
Done.
Searching for "untitled 06 | 06.30.2014." by Kendrick Lamar...
Done.
Searching for "untitled 07 | 2014 - 2016" by Kendrick Lamar...
Done.
Searching for "untitled 08 | 09.06.2014." by Kendrick Lamar...
Done.
Searching for "Wesley's Theory" by Kendrick Lamar...
Done.
Searching for "For Free? - Interlude" by Kendrick Lamar...
Done.
Searching for "King Kunta" by Kendrick Lamar...
Done.
Searching for "Institutionalized" by Kendrick Lamar...
Done.
Searching for "These Walls" by Kendrick Lamar...
Done.
Searching for "u" by Kendrick Lamar...
Done.
Searching for "Alright" by Kendrick Lamar...
Done.
Searching for "For Sale? - Interlude" by Kendrick Lamar...
Done.
Searching for "Momma" by Kendrick Lamar...
Done.
Searching for "Hood Politics" by Kendrick Lamar...
Done.
Searching for "How Much A Dollar Cost" by Kendrick Lamar...
Done.
Searching for "Complexion (A Zulu Love)" by Kendrick Lamar...
Done.
Searching for "The Blacker The Berry" by Kendrick Lamar...
Done.
Searching for "You Ain't Gotta Lie (Momma Said)" by Kendrick Lamar...
Done.
Searching for "i" by Kendrick Lamar...
Done.
Searching for "Mortal Man" by Kendrick Lamar...
Done.
Searching for "Sherane a.k.a Master Splinter’s Daughter" by Kendrick Lamar...
Done.
Searching for "Bitch, Don’t Kill My Vibe" by Kendrick Lamar...
Done.
Searching for "Backseat Freestyle" by Kendrick Lamar...
Done.
Searching for "The Art of Peer Pressure" by Kendrick Lamar...
Done.
Searching for "Money Trees" by Kendrick Lamar...
Done.
Searching for "Poetic Justice" by Kendrick Lamar...
Done.
Searching for "good kid" by Kendrick Lamar...
Done.
Searching for "m.A.A.d city" by Kendrick Lamar...
Done.
Searching for "Swimming Pools (Drank)" by Kendrick Lamar...
Done.
Searching for "Sing About Me, I'm Dying Of Thirst" by Kendrick Lamar...
Done.
Searching for "Real" by Kendrick Lamar...
Done.
Searching for "Compton" by Kendrick Lamar...
Done.
Searching for "The Recipe - Bonus Track" by Kendrick Lamar...
Done.
Searching for "Black Boy Fly - Bonus Track" by Kendrick Lamar...
Done.
Searching for "Now Or Never - Bonus Track" by Kendrick Lamar...
Done.
Searching for "The Recipe (Black Hippy Remix) - Bonus Track" by Kendrick Lamar...
Done.
Searching for "Bitch, Don’t Kill My Vibe - Remix" by Kendrick Lamar...
Done.
Searching for "Fuck Your Ethnicity" by Kendrick Lamar...
Done.
Searching for "Hol' Up" by Kendrick Lamar...
Done.
Searching for "A.D.H.D" by Kendrick Lamar...
Done.
Searching for "No Make-Up (Her Vice) (feat. Colin Munroe)" by Kendrick Lamar...
Done.
Searching for "Tammy's Song (Her Evils)" by Kendrick Lamar...
Done.
Searching for "Chapter Six" by Kendrick Lamar...
Done.
Searching for "Ronald Reagan Era" by Kendrick Lamar...
Done.
Searching for "Poe Mans Dreams (His Vice) (feat. GLC)" by Kendrick Lamar...
Done.
Searching for "Chapter Ten" by Kendrick Lamar...
Done.
Searching for "Keisha's Song (Her Pain) (feat. Ashtro Bot)" by Kendrick Lamar...
Done.
Searching for "Rigamortus" by Kendrick Lamar...
Done.
Searching for "Kush & Corinthians (feat. BJ The Chicago Kid)" by Kendrick Lamar...
Done.
Searching for "Blow My High (Members Only)" by Kendrick Lamar...
Done.
Searching for "Ab-Souls Outro (feat. Ab-Soul)" by Kendrick Lamar...
Done.
Searching for "HiiiPower" by Kendrick Lamar...
Done.
Searching for "Growing Apart (To Get Closer)" by Kendrick Lamar...
Done.
Searching for "Ignorance Is Bliss" by Kendrick Lamar...
Done.
Searching for "P&P 1.5" by Kendrick Lamar...
Done.
Searching for "Alien Girl (Today W/ Her)" by Kendrick Lamar...
Done.
Searching for "Opposites Attract (Tomorrow W/O Her)" by Kendrick Lamar...
Done.
Searching for "Michael Jordan" by Kendrick Lamar...
Done.
Searching for "R.O.T.C (Interlude)" by Kendrick Lamar...
Done.
Searching for "Barbed Wire" by Kendrick Lamar...
Done.
Searching for "Average Joe" by Kendrick Lamar...
Done.
Searching for "H.O.C" by Kendrick Lamar...
Done.
Searching for "Cut You Off (To Grow Closer)" by Kendrick Lamar...
Done.
Searching for "She Needs Me (Remix)" by Kendrick Lamar...
Done.

A quck inspection of the list of lyrics reveals that we got a few None values. Let's fix those:

In [11]:

while None in lyrics:
    none_index = [i for i in range(len(lyrics)) if lyrics[i] == None] 
    for i in none_index:
        lyrics[i] = genius.search_song(df_kendrick['name'][i], "Kendrick Lamar")

We are ready to extract the lyrics and append them into the dataframe.

In [12]:

lyrics_text = []
for lyric in lyrics:
    lyrics_text.append(lyric.lyrics.lower())
    
df_kendrick['lyrics'] = lyrics_text
df_kendrick.head()

Out[12]:

	name	album	artist	release_date	length	popularity	danceability	acousticness	energy	instrumentalness	liveness	loudness	speechiness	valence	tempo	time_signature	lyrics
0	BLOOD.	DAMN.	Kendrick Lamar	2017-04-14	118066	61	0.357	0.14200	0.238	0.085900	0.5500	-16.780	0.265	0.494	156.907	4	is it wickedness?\nis it weakness?\nyou decide...
1	DNA.	DAMN.	Kendrick Lamar	2017-04-14	185946	78	0.638	0.00454	0.523	0.000000	0.0842	-6.664	0.357	0.422	139.913	4	i got, i got, i got, i got—\nloyalty, got roya...
2	YAH.	DAMN.	Kendrick Lamar	2017-04-14	160293	64	0.670	0.57600	0.700	0.000005	0.2260	-7.893	0.196	0.648	69.986	4	new shit, new kung fu kenny\n\ni got so many t...
3	ELEMENT.	DAMN.	Kendrick Lamar	2017-04-14	208733	71	0.748	0.20400	0.705	0.000000	0.2460	-4.547	0.485	0.483	189.891	4	new kung fu kenny\nain't nobody prayin' for me...
4	FEEL.	DAMN.	Kendrick Lamar	2017-04-14	214826	64	0.746	0.13700	0.798	0.000000	0.1390	-8.382	0.349	0.553	109.968	4	ain't nobody prayin' for me\n(ain't nobody pra...

Quantifying Lyrical Sentiment.¶

In order to quantify lyrical sentiment I will use the NRC Emotion Lexicon. This dataset assigns different emotions or sentiments to english words. For example, the word abandon is mapped to the emotions of fear and sadness. My strategy is to compute the proportions for each of the sentiments. I recognize that this is a very limited approach to lyric analysis but at the same time I believe that it feels like a good starting point given the current constraints.

Let's create a function that given a song's lyrics and a emotion computes the number of words in the song associated to the emotion.

In [13]:

nrc = pd.read_table("NRC-Emotion-Lexicon-Wordlevel-v0.92.txt", header=None, names=['word','emotion','dummy'])

In [14]:

def count_emotion(song_lyrics, emotion):
    nrc_emotion = nrc[nrc['emotion'] == emotion]
    nrc_emotion = nrc_emotion[nrc_emotion['dummy'] == 1]
    nrc_emotion = nrc_emotion.reset_index(drop=True)
    sum_emotion = 0
    for word in nrc_emotion['word']:
        sum_emotion = sum_emotion + song_lyrics.count(word)    
    return sum_emotion

Next, we create a function that computes the number of words associated to each emotion in a song and computes its proportions.

In [15]:

def emotion_prop(song_lyrics):
    emotion_proportions = []
    for emotion in nrc['emotion'].unique():
        emotion_proportions.append(count_emotion(song_lyrics, emotion))
    array = np.array(emotion_proportions)
    return array/array.sum()

Finally, we append these proportions for each song in the dataframe:

In [16]:

for i in range(len(nrc['emotion'].unique())):
    df_kendrick[nrc['emotion'].unique()[i]+"_index"] = df_kendrick.apply(lambda row: emotion_prop(row['lyrics'])[i], axis=1)
    df_kendrick.head()

As desired. We are ready to begin visualizing Kendrick's discography!

Visualization¶

First I wanted to see what type of lyrical and musical features dominate - in average - in each Kendrick album. To do so, I decided to produce an interactive radar plot with all albums overlayed in top of each other. This plot might look messy at first, but one can select (or unselect) each album by clicking over its legend or plot. In this way, we have a clear yet compact visualization.

In [17]:

lyrical_variables = nrc['emotion'].unique()+"_index"
musical_variables =  ["danceability", "acousticness", "energy", "instrumentalness", "liveness","speechiness", "valence"]

In [18]:

def radar_discography_plot(df,x,musical,export=False):
    artist = df['artist'][0] 
    df = df[::-1]
    df = df.groupby('album',as_index=False,sort=False).mean()
    df = df.drop(['length','popularity','loudness','tempo','time_signature'], axis = 1)
    df_long = pd.melt(df, id_vars='album', value_vars=df.columns.values[1:])
    if musical == True:
        fig = px.line_polar(df_long, r="value", theta="variable", color="album", line_close=True, range_r=[0,1],
                    color_discrete_sequence=px.colors.qualitative.Bold)
    else:
        fig = px.line_polar(df_long, r="value", theta="variable", color="album", line_close=True, range_r=[0,0.3],
                    color_discrete_sequence=px.colors.qualitative.Bold)
    if musical == True:
        fig.update_layout(title = {'text':"Musical Features of " + artist + "s' Discography",'y':0.98,'x':x,'xanchor': 'center','yanchor': 'top'})
    else:
        fig.update_layout(title = {'text':"Lyrical Features of " + artist + "s' Discography",'y':0.98,'x':x,'xanchor': 'center','yanchor': 'top'})
    fig.show() 
    if export == True:
        fig.write_html(artist+".html")

First let's visualize the musical features of each of his albums.

In [19]:

radar_discography_plot(df_kendrick.drop(lyrical_variables,axis=1).copy(),0.45, musical = True)
radar_discography_plot(df_kendrick.drop(musical_variables,axis=1).copy(),0.45, musical = False)

Still, while this visualization is effective at comparing what features dominate in one particular album, it is not as effective for comparing one feature across his dicograpgy. For such purposes it is better to create a bar chart:

In [20]:

def bar_discography_plot(df,x,musical, export=False):
    artist = df['artist'][0] 
    df = df[::-1]
    df = df.groupby('album',as_index=False,sort=False).mean()
    df = df.drop(['length','popularity','loudness','tempo','time_signature'], axis = 1)
    df_long = pd.melt(df, id_vars='album', value_vars=df.columns.values[1:])
    df_long
    if musical == True:
        fig = px.bar(df_long, x="variable", y="value", color='album', barmode='group', height=400, range_y = [0,1])
        fig.update_layout(title = {'text':"Musical Features of " + artist + "s' Discography",'y':0.95,'x':x,'xanchor': 'center','yanchor': 'top'})
    else:
        fig = px.bar(df_long, x="variable", y="value", color='album', barmode='group', height=400, range_y = [0,0.3])
        fig.update_layout(title = {'text':"Lyrical Features of " + artist + "s' Discography",'y':0.95,'x':x,'xanchor': 'center','yanchor': 'top'})
    fig.show() 
    if export == True:
        fig.write_html(artist+".html")

In [21]:

bar_discography_plot(df_kendrick.drop(lyrical_variables,axis=1).copy(),x=0.45,musical=True)
bar_discography_plot(df_kendrick.drop(musical_variables,axis=1).copy(),0.45, musical = False)

Overall, I would argue that these visualizations adequately represent Kendrick's music. For example, untitled unmastered stands out for its live instrumentation and somewhat subdued tone. On the other hand, Kendrick is characterized for his intertwined story telling: To Pimp a Butterfly references good kid, m.A.A.d city while DAMN. references its two predecessors. It is to be expected that his lyrics have pretty similar features across his discography. One final point of interest is the predominance of both negative and positive lyrics. While seemingly contradictory, duality has been a staple in Kendrick's music. For example, consider, u and i from To Pimp a Butterfly : members of the Genius community have noticed the the duality between both tracks: "u acts as a complete contrast to its lead single i, an anthem of peace, positivity, and prosperity starting with self-love".

I hope this first part of the project incentivizes the reader to explore the discography of their favorite artist!

PART II: Predicting musical success.¶

While it was fun to look at Kendrick's discography, such data will not allow me to answer the main question that drives this research: what are the determinants of popularity in music? This is because the Spotify API does not allow us to count the number of times a song has been streamed. The closer feature to this would be a track's popularity , however, there are two issues: the algorithm that drives this variable is not know and the variable appears like a better indicator of current popularity - i.e what is trending - rather than all-time success. Thus I decided to use a dataset from Wikipedia that contains information about the Top 100 most streamed songs of all time. I will use once again similar methods to obtain both musical and lyrical features from each song.

In [22]:

df_wiki = pd.read_html("https://en.wikipedia.org/wiki/List_of_most-streamed_songs_on_Spotify#100_most-streamed_songs")[0]
df_wiki = df_wiki.drop(100)
df_wiki = df_wiki.drop('Rank',axis=1)
df_wiki.columns = ['name', 'artist','album','streams','release_date']
df_wiki

Out[22]:

	name	artist	album	streams	release_date
0	"Shape of You"	Ed Sheeran	÷	2476	6 January 2017
1	"Rockstar"	Post Malone featuring 21 Savage	Beerbongs & Bentleys	1882	15 September 2017
2	"One Dance"	Drake featuring Wizkid and Kyla	Views	1840	5 April 2016
3	"Closer"	The Chainsmokers featuring Halsey	Collage	1757	29 July 2016
4	"Thinking Out Loud"	Ed Sheeran	×	1521	20 June 2014
...	...	...	...	...	...
95	"Moonlight"	XXXTentacion	?	951	14 August 2018
96	"Work"	Rihanna featuring Drake	Anti	949	27 January 2016
97	"Lovely"	Billie Eilish and Khalid	13 Reasons Why: Season 2 (A Netflix Original S...	952	19 April 2018
98	"There's Nothing Holdin' Me Back"	Shawn Mendes	Illuminate	945	20 April 2017
99	"Me Rehúso"	Danny Ocean	54+1	938	16 September 2016

100 rows × 5 columns

First I standarize the data such that we can find its musical features in Spotify first. I write a few functions that will help me for that task.

In [23]:

def fix_name(string):
    return string.replace('"','')

def fix_featuring(string):
    sep = ' featuring'
    rest = string.split(sep, 1)[0]
    return rest

def fix_and(string):
    sep = ' and'
    rest = string.split(sep, 1)[0]
    return rest

def fix_apostrophe(string):
    return(string.replace("'",""))

We are ready to standarize the data.

In [24]:

df_wiki['name'] = df_wiki['name'].apply(fix_name)
df_wiki['name'] = df_wiki['name'].apply(fix_apostrophe)
df_wiki['artist'] = df_wiki['artist'].apply(fix_featuring)
df_wiki['artist'] = df_wiki['artist'].apply(fix_and)
df_wiki.at[78, 'name'] = "I Don't Wanna Live Forever"

Now, we obtain the id's for every track

In [25]:

ids = []
for i in range(len(df_wiki['artist'])):
    try:
        artist= df_wiki['artist'][i]
        track= df_wiki['name'][i]
        track_id = sp.search(q='artist:' + artist + ' track:' + track, type='track')
        ids.append(track_id['tracks']['items'][0]['id'])
    except:
        print(i)

Finally, we create the dataframe.

In [26]:

features = get_tracks_features(ids)
tracks_features_to_csv(features, "top100")
df_top = pd.read_csv("top100.csv",index_col=[0])
df_top['streams'] = df_wiki['streams']
df_top.head()

Out[26]:

	name	album	artist	release_date	length	popularity	danceability	acousticness	energy	instrumentalness	liveness	loudness	speechiness	valence	tempo	time_signature	streams
0	Shape of You	÷ (Deluxe)	Ed Sheeran	2017-03-03	233712	86	0.825	0.58100	0.652	0.00000	0.0931	-3.183	0.0802	0.931	95.977	4	2476
1	rockstar (feat. 21 Savage)	beerbongs & bentleys	Post Malone	2018-04-27	218146	88	0.585	0.12400	0.520	0.00007	0.1310	-6.136	0.0712	0.129	159.801	4	1882
2	One Dance	Views	Drake	2016-05-06	173986	82	0.792	0.00776	0.625	0.00188	0.3290	-5.609	0.0536	0.370	103.967	4	1840
3	Closer	Closer	The Chainsmokers	2016-07-29	244960	85	0.748	0.41400	0.524	0.00000	0.1110	-5.599	0.0338	0.661	95.010	4	1757
4	Thinking out Loud	x (Deluxe Edition)	Ed Sheeran	2014-06-21	281560	83	0.781	0.47400	0.445	0.00000	0.1840	-6.061	0.0295	0.591	78.998	4	1521

We have the musical features with the number of streams. Now, let's add the lyrics! However, before adding them it is better to standarize the naming of songs across the current database and the Genius API since the wrapper for the latter is unreliable. This way, we minimize the chance of error.

In [27]:

def fix_featuring2(string):
    sep = ' (feat'
    rest = string.split(sep, 1)[0]
    return rest

def fix_with(string):
    sep = ' (with'
    rest = string.split(sep, 1)[0]
    return rest

In [28]:

df_top['name'] = df_top['name'].apply(fix_featuring2)
df_top['name'] = df_top['name'].apply(fix_with)

In [29]:

df_top.at[6, 'name'] = "Sunflower"
df_top.at[6, 'artist'] = "Post Malone"
df_top.at[20, 'artist'] = "Justin Bieber"
df_top.at[35, 'name'] = "Bohemian Rhapsody"
df_top.at[35, 'artist'] = "Queen"
df_top.at[49, 'artist'] = "Justin Bieber"
df_top.at[54, 'name'] = "Can't Stop the Feeling"
df_top.at[54, 'artist'] = "Justin Timberlake"
df_top.at[78, 'name'] = "I Don't Wanna Live Forever"
df_top.at[86, 'artist'] = "J Balvin"
# some addition fixes
df_top.at[65, 'release_date'] = '2015-03-10'
df_top.at[20, 'release_date'] = '2015-10-22'

In [30]:

lyrics = []
for i in range(len(df_top['name'])):
    lyrics.append(genius.search_song(df_top['name'][i], df_top['artist'][i]))

Searching for "Shape of You" by Ed Sheeran...
Done.
Searching for "rockstar" by Post Malone...
Done.
Searching for "One Dance" by Drake...
Done.
Searching for "Closer" by The Chainsmokers...
Done.
Searching for "Thinking out Loud" by Ed Sheeran...
Done.
Searching for "God's Plan" by Drake...
Done.
Searching for "Sunflower" by Post Malone...
Done.
Searching for "Dance Monkey" by Tones And I...
Done.
Searching for "Havana" by Camila Cabello...
Done.
Searching for "Perfect" by Ed Sheeran...
Done.
Searching for "Say You Won't Let Go" by James Arthur...
Done.
Searching for "Love Yourself" by Justin Bieber...
Done.
Searching for "Señorita" by Shawn Mendes...
Done.
Searching for "Photograph" by Ed Sheeran...
Done.
Searching for "Lean On" by Major Lazer...
Done.
Searching for "Despacito - Remix" by Luis Fonsi...
Done.
Searching for "Believer" by Imagine Dragons...
Done.
Searching for "Starboy" by The Weeknd...
Done.
Searching for "New Rules" by Dua Lipa...
Done.
Searching for "bad guy" by Billie Eilish...
Done.
Searching for "Sorry" by Justin Bieber...
Done.
Searching for "Don't Let Me Down" by The Chainsmokers...
Done.
Searching for "Something Just Like This" by The Chainsmokers...
Done.
Searching for "Thunder" by Imagine Dragons...
Done.
Searching for "SAD!" by XXXTENTACION...
Done.
Searching for "I Took A Pill In Ibiza - Seeb Remix" by Mike Posner...
Done.
Searching for "XO Tour Llif3" by Lil Uzi Vert...
Done.
Searching for "HUMBLE." by Kendrick Lamar...
Done.
Searching for "Let Me Love You" by DJ Snake...
Done.
Searching for "Faded" by Alan Walker...
Done.
Searching for "Better Now" by Post Malone...
Done.
Searching for "Lucid Dreams" by Juice WRLD...
Done.
Searching for "Stressed Out" by Twenty One Pilots...
Done.
Searching for "Congratulations" by Post Malone...
Done.
Searching for "All of Me" by John Legend...
Done.
Searching for "Bohemian Rhapsody" by Queen...
Done.
Searching for "Someone You Loved" by Lewis Capaldi...
Done.
Searching for "Happier" by Marshmello...
Done.
Searching for "Treat You Better" by Shawn Mendes...
Done.
Searching for "Take Me to Church" by Hozier...
Done.
Searching for "Unforgettable" by French Montana...
Done.
Searching for "Cheap Thrills" by Sia...
Done.
Searching for "Uptown Funk" by Mark Ronson...
Done.
Searching for "Stay With Me" by Sam Smith...
Done.
Searching for "7 rings" by Ariana Grande...
Done.
Searching for "Let Her Go" by Passenger...
Done.
Searching for "Shallow" by Lady Gaga...
Done.
Searching for "Despacito - Remix" by Luis Fonsi...
Done.
Searching for "Cold Water" by Major Lazer...
Done.
Searching for "What Do You Mean?" by Justin Bieber...
Done.
Searching for "Jocelyn Flores" by XXXTENTACION...
Done.
Searching for "7 Years" by Lukas Graham...
Done.
Searching for "Girls Like You" by Maroon 5...
Done.
Searching for "thank u, next" by Ariana Grande...
Done.
Searching for "Can't Stop the Feeling" by Justin Timberlake...
Done.
Searching for "Too Good At Goodbyes" by Sam Smith...
Done.
Searching for "Stitches" by Shawn Mendes...
Done.
Searching for "That's What I Like" by Bruno Mars...
Done.
Searching for "Cheerleader - Felix Jaehn Remix Radio Edit" by OMI...
Done.
Searching for "Wake Me Up" by Avicii...
Done.
Searching for "I Like It" by Cardi B...
Done.
Searching for "Psycho" by Post Malone...
Done.
Searching for "This Is What You Came For" by Calvin Harris...
Done.
Searching for "Heathens" by Twenty One Pilots...
Done.
Searching for "Can't Feel My Face" by The Weeknd...
Done.
Searching for "See You Again" by Wiz Khalifa...
Done.
Searching for "Hello" by Adele...
Done.
Searching for "Radioactive" by Imagine Dragons...
Done.
Searching for "I Fall Apart" by Post Malone...
Done.
Searching for "Work from Home" by Fifth Harmony...
Done.
Searching for "Attention" by Charlie Puth...
Done.
Searching for "Counting Stars" by OneRepublic...
Done.
Searching for "In My Feelings" by Drake...
Done.
Searching for "Without Me" by Halsey...
Done.
Searching for "I Don't Care" by Ed Sheeran...
Done.
Searching for "Can't Hold Us - feat. Ray Dalton" by Macklemore & Ryan Lewis...
Done.
Searching for "One Kiss" by Calvin Harris...
Done.
Searching for "SICKO MODE" by Travis Scott...
Done.
Searching for "I Don't Wanna Live Forever" by Taylor Swift...
Done.
Searching for "Chandelier" by Sia...
Done.
Searching for "Rockabye" by Clean Bandit...
Done.
Searching for "Taki Taki" by DJ Snake...
Done.
Searching for "The Hills" by The Weeknd...
Done.
Searching for "Eastside" by benny blanco...
Done.
Searching for "Sugar" by Maroon 5...
Done.
Searching for "We Don't Talk Anymore" by Charlie Puth...
Done.
Searching for "Mi Gente" by J Balvin...
Done.
Searching for "IDGAF" by Dua Lipa...
Done.
Searching for "I'm the One" by DJ Khaled...
Done.
Searching for "I Like Me Better" by Lauv...
Done.
Searching for "Demons" by Imagine Dragons...
Done.
Searching for "It Ain’t Me" by Kygo...
Done.
Searching for "Riptide" by Vance Joy...
Done.
Searching for "Ride" by Twenty One Pilots...
Done.
Searching for "Old Town Road - Remix" by Lil Nas X...
Done.
Searching for "Moonlight" by XXXTENTACION...
Done.
Searching for "Work" by Rihanna...
Done.
Searching for "lovely" by Billie Eilish...
Done.
Searching for "There's Nothing Holdin' Me Back" by Shawn Mendes...
Done.
Searching for "Me Rehúso" by Danny Ocean...
Done.

There are still some errors despite the standarization. We solve these by detecting the null values and replacing by correct values

In [31]:

while None in lyrics:
    none_index = [i for i in range(len(lyrics)) if lyrics[i] == None] 
    for i in none_index:
        lyrics[i] = genius.search_song(df_top['name'][i], df_top['artist'][i])

Next, we append the lyrics into the dataframe and operate to get its emotional features.

In [32]:

lyrics_text = []
for lyric in lyrics:
    lyrics_text.append(lyric.lyrics.lower())
df_top['lyrics'] = lyrics_text

Finally, we use the NRC Lexicon to obtain the lyrical features of each song:

In [33]:

for i in range(len(nrc['emotion'].unique())):
    df_top[nrc['emotion'].unique()[i]+"_index"] = df_top.apply(lambda row: emotion_prop(row['lyrics'])[i], axis=1)

Finally, we transform the release date into a date type and obtain the number of days since release as a measure of the song's longevity. First, we create a few helper functions.

In [34]:

def string_to_date(string):
    return date(*map(int, string.split('-')))

df_top['release_date'] = df_top['release_date'].apply(string_to_date)

In [35]:

def days_since_release(date):
    x = date.today() - date
    return x.days

df_top['days_since_release'] = df_top['release_date'].apply(days_since_release)

Finally, we apply some transformations on a few variables in order to minimize the condition number of the linear regression.

In [36]:

df_top['streams'] = df_top['streams'].apply(lambda x: float(x))
df_top['loudness'] = (df_top['loudness']-df_top['loudness'].min())/(df_top['loudness'].max()-df_top['loudness'].min())
df_top['length'] = (df_top['length']-df_top['length'].min())/(df_top['length'].max()-df_top['length'].min())
df_top['popularity'] = (df_top['popularity']-df_top['popularity'].min())/(df_top['popularity'].max()-df_top['popularity'].min())
df_top['days_since_release'] = (df_top['days_since_release']-df_top['days_since_release'].min())/(df_top['days_since_release'].max()-df_top['days_since_release'].min())
df_top['tempo'] = (df_top['tempo']-df_top['tempo'].min())/(df_top['tempo'].max()-df_top['tempo'].min())
df_top['streams_scaled'] = (df_top['streams']-df_top['streams'].min())/(df_top['streams'].max()-df_top['streams'].min())

In [37]:

#we drop non-relevant variables and anticipation_index due to multicollinearity
X = df_top.drop(["streams_scaled","name", "album", "artist","release_date",
                 "popularity","time_signature","streams","lyrics","anticipation_index"], axis=1).copy()
X2 = sm.add_constant(X)
y = df_top['streams_scaled']
lr_model = linear_model.LinearRegression()
lr_model.fit(X, y)
print(sm.OLS(y, X2).fit().summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         streams_scaled   R-squared:                       0.239
Model:                            OLS   Adj. R-squared:                  0.047
Method:                 Least Squares   F-statistic:                     1.244
Date:                Mon, 20 Apr 2020   Prob (F-statistic):              0.243
Time:                        18:04:44   Log-Likelihood:                 60.175
No. Observations:                 100   AIC:                            -78.35
Df Residuals:                      79   BIC:                            -23.64
Df Model:                          20                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                 -0.1094      0.531     -0.206      0.837      -1.166       0.947
length                -0.0853      0.128     -0.664      0.508      -0.341       0.170
danceability           0.0269      0.152      0.176      0.860      -0.276       0.330
acousticness          -0.0050      0.085     -0.059      0.953      -0.174       0.164
energy                -0.4499      0.189     -2.382      0.020      -0.826      -0.074
instrumentalness       1.4024      0.826      1.697      0.094      -0.242       3.047
liveness               0.3604      0.172      2.099      0.039       0.019       0.702
loudness               0.2562      0.126      2.040      0.045       0.006       0.506
speechiness            0.1976      0.221      0.893      0.375      -0.243       0.638
valence                0.0854      0.102      0.840      0.403      -0.117       0.288
tempo                  0.0330      0.076      0.433      0.666      -0.119       0.185
anger_index            0.3789      0.855      0.443      0.659      -1.322       2.080
disgust_index         -0.0465      0.728     -0.064      0.949      -1.495       1.402
fear_index            -0.4628      0.705     -0.656      0.513      -1.866       0.940
joy_index              0.9893      0.770      1.284      0.203      -0.544       2.523
negative_index        -0.1660      0.618     -0.269      0.789      -1.396       1.064
positive_index         0.3695      0.589      0.627      0.533      -0.804       1.543
sadness_index          1.1241      0.686      1.639      0.105      -0.241       2.489
surprise_index        -0.0660      1.087     -0.061      0.952      -2.230       2.098
trust_index            0.3652      0.691      0.529      0.599      -1.010       1.741
days_since_release     0.1396      0.192      0.729      0.468      -0.242       0.521
==============================================================================
Omnibus:                       55.827   Durbin-Watson:                   0.543
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              227.537
Skew:                           1.857   Prob(JB):                     3.90e-50
Kurtosis:                       9.389   Cond. No.                         215.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We are ready to go! Let's visualize the relationship between the musical and lyrical features of a song and its popularity.

In [38]:

columns_music = (df_top.columns)[6:14]
fig = make_subplots(rows=4, cols=2,shared_yaxes=True,
                    vertical_spacing = 0.05,
                    horizontal_spacing = 0.025, 
                    y_title = "Scaled streams" )
for i in range(4):
    for j in range(2):
        fig_trend = px.scatter(df_top, x=columns_music[2*i+j], y="streams_scaled",trendline = "ols",hover_name="name")
        fig.add_trace(fig_trend.data[0], row=i+1, col=j+1)
        fig.add_trace(fig_trend.data[1], row=i+1, col=j+1)
        fig.add_trace(go.Scatter(x = df_top[columns_music[2*i+j]], y=lr_model.predict(X),
                                 mode="markers",marker=dict(color='Orange'),
                                 hovertemplate = "<b>Full OLS model prediction</b><br><br>" + 
                                 "streams: %{y}<br>" + columns_music[2*i+j] +": %{x}<br>"),
                      row=i+1, col=j+1)
        results = px.get_trendline_results(fig_trend)
        fig.update_xaxes(title_text=columns_music[2*i+j].capitalize(),row=i+1, col=j+1)
fig.update_layout(height=2000, width=900, title= {'text': "Musical Features", 'x':0.5},showlegend=False)
fig.update_traces(marker_size=4)
fig.show()

Let's try to analyze lyrical features next!

In [39]:

columns_emotion = (df_top.columns)[18:]
fig = make_subplots(rows=5, cols=2,shared_yaxes=True,
                    vertical_spacing = 0.05,
                    horizontal_spacing = 0.025, 
                    y_title = "Scaled streams" )
for i in range(5):
    for j in range(2):
        fig_trend = px.scatter(df_top, x=columns_emotion[2*i+j], y="streams_scaled",trendline = "ols",hover_name="name")
        fig.add_trace(fig_trend.data[0], row=i+1, col=j+1)
        fig.add_trace(fig_trend.data[1], row=i+1, col=j+1)
        fig.add_trace(go.Scatter(x = df_top[columns_emotion[2*i+j]], y=lr_model.predict(X),
                                  mode="markers",marker=dict(color='Orange'),
                                 hovertemplate = "<b>Full OLS model prediction</b><br><br>" + 
                                 "streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
                      row=i+1, col=j+1)
        results = px.get_trendline_results(fig_trend)
        fig.update_xaxes(title_text=(columns_emotion[2*i+j].capitalize()).replace("_"," "),row=i+1, col=j+1)
fig.update_layout(height=2000, width=1000, title= {'text': "Lyrical Features", 'x':0.5}, showlegend = False)
fig.update_traces(marker_size=4)
fig.show()

The model seems to perform quite OK. Still, I am a bit worried some variables are not bringing anything to the table. I will try to refine the model in the next section.

Refining the model¶

Let's implement a Lasso regression and compare its MSE to our previous linear model.

In [40]:

lasso_model = linear_model.Lasso()
lasso_model.fit(X, y)
print("Linear Regression MSE:", metrics.mean_squared_error(y, lr_model.predict(X)))
print("Lasso Regression MSE:", metrics.mean_squared_error(y, lasso_model.predict(X)))

Linear Regression MSE: 0.017573399462309323
Lasso Regression MSE: 0.023106569591163435

Interestingly, it seems that our Linear model manages to outperform our Lasso models. Let's visualize their different predictions as before.

In [41]:

fig = make_subplots(rows=4, cols=2,shared_yaxes=True,
                    vertical_spacing = 0.05,
                    horizontal_spacing = 0.025, 
                    y_title = "Scaled streams" )
for i in range(4):
    for j in range(2):
        fig_trend = px.scatter(df_top, x=columns_music[2*i+j], y="streams_scaled",trendline = "ols",hover_name="name")
        fig.add_trace(fig_trend.data[0], row=i+1, col=j+1)
        fig.add_trace(fig_trend.data[1], row=i+1, col=j+1)
        fig.add_trace(go.Scatter(x = df_top[columns_music[2*i+j]], y=lr_model.predict(X),
                                 mode="markers",marker=dict(color='Orange'),
                                 hovertemplate = "<b>Full OLS model prediction</b><br><br>" + 
                                 "streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
                      row=i+1, col=j+1)
        fig.add_trace(go.Scatter(x = df_top[columns_music[2*i+j]], y=lasso_model.predict(X),
                                 marker=dict(color='Red'),
                                 hovertemplate = "<b>Lasso model prediction</b><br><br>" + 
                                 "streams: %{y}<br>" + columns_music[2*i+j] +": %{x}<br>"),
                      row=i+1, col=j+1)
        results = px.get_trendline_results(fig_trend)
        fig.update_xaxes(title_text=(columns_music[2*i+j].capitalize()).replace("_"," "),row=i+1, col=j+1)
fig.update_layout(height=2000, width=1000, title= {'text': "Musical Features", 'x':0.5}, showlegend = False)
fig.update_traces(marker_size=4)
fig.show()

In [42]:

fig = make_subplots(rows=5, cols=2,shared_yaxes=True,
                    vertical_spacing = 0.05,
                    horizontal_spacing = 0.025, 
                    y_title = "Scaled streams" )
for i in range(5):
    for j in range(2):
        fig_trend = px.scatter(df_top, x=columns_emotion[2*i+j], y="streams_scaled",trendline = "ols",hover_name="name")
        fig.add_trace(fig_trend.data[0], row=i+1, col=j+1)
        fig.add_trace(fig_trend.data[1], row=i+1, col=j+1)
        fig.add_trace(go.Scatter(x = df_top[columns_emotion[2*i+j]], y=lr_model.predict(X),
                                 mode="markers",marker=dict(color='Orange'),
                                 hovertemplate = "<b>Full OLS model prediction</b><br><br>" + 
                                 "streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
                      row=i+1, col=j+1)
        fig.add_trace(go.Scatter(x = df_top[columns_emotion[2*i+j]], y=lasso_model.predict(X),
                                 marker=dict(color='Red'),
                                 hovertemplate = "<b>Lasso model prediction</b><br><br>" + 
                                 "streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
                      row=i+1, col=j+1)
        results = px.get_trendline_results(fig_trend)
        fig.update_xaxes(title_text=(columns_emotion[2*i+j].capitalize()).replace("_"," "),row=i+1, col=j+1)
fig.update_layout(height=2000, width=1000, title= {'text': "Lyrical Features", 'x':0.5}, showlegend = False)
fig.update_traces(marker_size=4)
fig.show()

Something fishy is going on with the lasso regression... It looks like an horizontal line! Let's extract the coefficients of both model and compare them.

In [43]:

lasso_model = linear_model.Lasso()
lasso_model.fit(X, y)
lasso_coefs = pd.Series(dict(zip(list(X), lasso_model.coef_)))
lr_coefs = pd.Series(dict(zip(list(X), lr_model.coef_)))
coefs = pd.DataFrame(dict(lasso=lasso_coefs, linreg=lr_coefs))
print(coefs)

                    lasso    linreg
length                0.0 -0.085312
danceability          0.0  0.026892
acousticness          0.0 -0.005017
energy               -0.0 -0.449886
instrumentalness      0.0  1.402413
liveness              0.0  0.360372
loudness              0.0  0.256238
speechiness           0.0  0.197584
valence               0.0  0.085369
tempo                -0.0  0.032987
anger_index          -0.0  0.378939
disgust_index        -0.0 -0.046521
fear_index           -0.0 -0.462835
joy_index             0.0  0.989285
negative_index       -0.0 -0.165996
positive_index        0.0  0.369506
sadness_index        -0.0  1.124084
surprise_index        0.0 -0.065989
trust_index           0.0  0.365189
days_since_release   -0.0  0.139555

Our suspicions are confirmed: all Lasso coefficients are 0. That's a pretty weird result. This means that our Lasso regression reduces to the simplest of estimators: sample mean of the dependent variable. Let's splice the model between testing and training data and compare their MSE to check whether our linear model still beats the Lasso model.

In [44]:

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25)

def fit_and_report_mses(mod, X_train, X_test, y_train, y_test):
    mod.fit(X_train, y_train)
    return dict(
        mse_train=metrics.mean_squared_error(y_train, mod.predict(X_train)),
        mse_test=metrics.mean_squared_error(y_test, mod.predict(X_test))
    )

print("Linear Model:", fit_and_report_mses(linear_model.LinearRegression(), X_train, X_test, y_train, y_test))
print("Lasso Regression:", fit_and_report_mses(linear_model.Lasso(), X_train, X_test, y_train, y_test))

Linear Model: {'mse_train': 0.011462616788246201, 'mse_test': 0.044234680031768454}
Lasso Regression: {'mse_train': 0.014931725291319514, 'mse_test': 0.04784550215519792}

Even if the linear model performs better on the training stage, Lasso (and hence, the naive estimator) beats our linear model. That's a quite worrying fact, since it means that our variables have very limited predicting capabilites. I will try one last model: a neural network model as implemented during the lecture notes. Let's see if we can beat both other models. First, I use it on the whole dataset.

In [45]:

# do not forget to scale the model!
nn_scaled_model = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    neural_network.MLPRegressor((30, 20)) 
)

nn_scaled_model.fit(X, y)

print("Linear Regression MSE:", metrics.mean_squared_error(y, lr_model.predict(X)))
print("Lasso Regression MSE:", metrics.mean_squared_error(y, lasso_model.predict(X)))
print("NN Regression MSE:", metrics.mean_squared_error(y, nn_scaled_model.predict(X)))

Linear Regression MSE: 0.017573399462309323
Lasso Regression MSE: 0.023106569591163435
NN Regression MSE: 0.007307496202566639

Cool! This model seems to outperform both previous models by a very good margin. Let's visualize the predictions.

In [47]:

fig = make_subplots(rows=4, cols=2,shared_yaxes=True,
                    vertical_spacing = 0.05,
                    horizontal_spacing = 0.025, 
                    y_title = "Scaled streams" )
for i in range(4):
    for j in range(2):
        fig_trend = px.scatter(df_top, x=columns_music[2*i+j], y="streams_scaled",trendline = "ols",hover_name="name")
        fig.add_trace(fig_trend.data[0], row=i+1, col=j+1)
        fig.add_trace(fig_trend.data[1], row=i+1, col=j+1)
        fig.add_trace(go.Scatter(x = df_top[columns_music[2*i+j]], y=lr_model.predict(X),
                                 mode="markers",marker=dict(color='Orange'),
                                 hovertemplate = "<b>Full OLS model prediction</b><br><br>" + 
                                 "streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
                      row=i+1, col=j+1)
        fig.add_trace(go.Scatter(x = df_top[columns_music[2*i+j]], y=lasso_model.predict(X),
                                 marker=dict(color='Red'),
                                 hovertemplate = "<b>Lasso model prediction</b><br><br>" + 
                                 "streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
                      row=i+1, col=j+1)
        fig.add_trace(go.Scatter(x = df_top[columns_music[2*i+j]], y=nn_scaled_model.predict(X),
                                 mode="markers",marker=dict(color='Purple'),
                                 hovertemplate = "<b>Neural Network model prediction</b><br><br>" + 
                                 "streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
                      row=i+1, col=j+1)
        results = px.get_trendline_results(fig_trend)
        fig.update_xaxes(title_text=(columns_emotion[2*i+j].capitalize()).replace("_"," "),row=i+1, col=j+1)
fig.update_layout(height=2000, width=1000, title= {'text': "Musical Features", 'x':0.5}, showlegend = False)
fig.update_traces(marker_size=4)
fig.show()

In [48]:

fig = make_subplots(rows=5, cols=2,shared_yaxes=True,
                    vertical_spacing = 0.05,
                    horizontal_spacing = 0.025, 
                    y_title = "Scaled streams" )
for i in range(5):
    for j in range(2):
        fig_trend = px.scatter(df_top, x=columns_emotion[2*i+j], y="streams_scaled",trendline = "ols",hover_name="name")
        fig.add_trace(fig_trend.data[0], row=i+1, col=j+1)
        fig.add_trace(fig_trend.data[1], row=i+1, col=j+1)
        fig.add_trace(go.Scatter(x = df_top[columns_emotion[2*i+j]], y=lr_model.predict(X),
                                 mode="markers",marker=dict(color='Orange'),
                                 hovertemplate = "<b>Full OLS model prediction</b><br><br>" + 
                                 "streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
                      row=i+1, col=j+1)
        fig.add_trace(go.Scatter(x = df_top[columns_emotion[2*i+j]], y=lasso_model.predict(X),
                                 marker=dict(color='Red'),
                                 hovertemplate = "<b>Lasso model prediction</b><br><br>" + 
                                 "streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
                      row=i+1, col=j+1)
        fig.add_trace(go.Scatter(x = df_top[columns_emotion[2*i+j]], y=nn_scaled_model.predict(X),
                                 mode="markers",marker=dict(color='Purple'),
                                 hovertemplate = "<b>Neural Network model prediction</b><br><br>" + 
                                 "streams: %{y}<br>" + columns_emotion[2*i+j] +": %{x}<br>"),
                      row=i+1, col=j+1)
        results = px.get_trendline_results(fig_trend)
        fig.update_xaxes(title_text=(columns_emotion[2*i+j].capitalize()).replace("_"," "),row=i+1, col=j+1)
fig.update_layout(height=2000, width=1000, title= {'text': "Lyrical Features", 'x':0.5}, showlegend = False)
fig.update_traces(marker_size=4)
fig.show()

Finally, let's divide split the dataset between a training/testing part and hope that our NN model beats the naive estimator.

In [49]:

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25)

def fit_and_report_mses(mod, X_train, X_test, y_train, y_test):
    mod.fit(X_train, y_train)
    return dict(
        mse_train=metrics.mean_squared_error(y_train, mod.predict(X_train)),
        mse_test=metrics.mean_squared_error(y_test, mod.predict(X_test))
    )

print("Linear Regression:", fit_and_report_mses(linear_model.LinearRegression(), X_train, X_test, y_train, y_test))
print("Lasso Regression:", fit_and_report_mses(linear_model.Lasso(), X_train, X_test, y_train, y_test))
print("NN Model:", fit_and_report_mses(nn_scaled_model, X_train, X_test, y_train, y_test))

Linear Regression: {'mse_train': 0.02048448685532459, 'mse_test': 0.013027925914433454}
Lasso Regression: {'mse_train': 0.0272749774766269, 'mse_test': 0.010809195586301959}
NN Model: {'mse_train': 0.0035198720881701186, 'mse_test': 0.05458352120520873}

Sadly, none of our models were able to beat the naive estimator in the testing phase.

Concluding remarks¶

I would like to continue refining these models as most likely there is an issue of overspecification. However, the project is already quite long so I feel this is a good place to end it. Visually, it seems to me that musical features are not very good indicators of popularity while lyrical features have stronger predictive capabilities: people tend to like positive and happy music, at least when it comes to its lyrical content. However, the triumph of the naive estimator is a quite dissapointing result, so further research on this topic would focus on which regressors should be included in the model and which regressors should be droped.

Besides these conclusions, my main objective with this project was to lay tge groundwork for future data analysis of music. That is, to create a somewhat structured way to gather this data and quantify its features. Still, given that the Spotify API does not gives us a way to access the number of times a song has been streamed, alternative ways to measure this variable should be explored. Otherwise conducting further research will prove difficult.