Loving you is complicated: Quantifying musical sentiment and success through data science

Spotify has quickly become the most popular music streaming service in the world with over 271 million active users every month. A close collaborator of Spotify is Genius), a website with 26.5 million monthly users that allows members of their community to upload, annotate, and interpret lyrics from music artists. Inspired by the blog of Thompson Analytics I will use the Spotify API, the Genius API and the NRC Emotion Lexicon to quantify both the musical sentiment and lyrical sentiment of one of the most prominentes artists of our time: Kendrick Lamar. A visualization of his discography will accompany this process of data collection. In addition to this, I will track the data of the top 100 most streamed songs of all time and will use different regression techniques to determine the usefulness of these features when it comes to the prediction of musical success.

PART I: Collecting Kendrick Lamar's Discography.

Spotify features

The Spotify API assigns musical features to every track on their platform. These features are defined in the following way:

  • Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
  • Danceability: Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
  • Energy: A measure from 0.0 to 1.0 that represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
  • Instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
  • Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
  • Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
  • Speechiness: Detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
  • Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

I will be using these features for visualization and analytical purposes in the first and second part respectively.

Getting the Data from Spotify

In order to get the data from spotify I will be using Spotipy , a lightweight Python library for the Spotify Web API:

In [1]:
pip install spotipy --upgrade
Requirement already up-to-date: spotipy in /opt/conda/lib/python3.7/site-packages (2.11.2)
Requirement already satisfied, skipping upgrade: six>=1.10.0 in /opt/conda/lib/python3.7/site-packages (from spotipy) (1.14.0)
Requirement already satisfied, skipping upgrade: requests>=2.20.0 in /opt/conda/lib/python3.7/site-packages (from spotipy) (2.23.0)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->spotipy) (2019.11.28)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->spotipy) (1.25.7)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->spotipy) (2.9)
Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->spotipy) (3.0.4)
Note: you may need to restart the kernel to use updated packages.

I will be using Plotly for most visualization tasks since I love how crisp it looks. For the second part of this project I will be relying on Scikit-learn for the computation of different prediction model. In addition to this I need to set the credentials obtained from my Spotify API application to access their data.

In [2]:
# import libraries  
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
import time
from datetime import date
import matplotlib.pyplot as plt
import pandas as pd
from math import pi
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import numpy as np
from sklearn import (linear_model, metrics, neural_network, pipeline, model_selection, preprocessing, pipeline)
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from scipy import stats

# authenticate and connect to the Spotify API
client_id = '547191f3201147df8e76a2aa96607aa3'
client_secret = '43cb18993cf04a53866a3690d1796600'
client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Next, I define a series of functions that will allow me to gather data from the Spotufy API and put it into a dataframe:

In [3]:
# Definining functions 

def get_artist(name):
    # Returns the artist object with respective data
    results = sp.search(q='artist:' + name, type='artist')
    items = results['artists']['items']
    if len(items) > 0:
        return items[0]
    else:
        return None

def get_artist_albums_ids(artist):
    # Given an artist object, returns a list of with each of the album's ids
    albums_ids = []
    for i in range (len(sp.artist_albums(artist['id'], album_type='album',limit=50)['items'])):
        albums_ids.append(sp.artist_albums(artist['id'], album_type='album',limit=50)['items'][i]['id'])
    return albums_ids

def filter_albums_ids(albums_ids):
    # Spotify has many versions of the same album. 
    # This function filters a list of albums ids and returns the most popular ones
    album_names = []    
    for album_id in albums_ids:
        album_names.append(sp.album(album_id)['name'])       
    album_pop = []    
    for album_id in albums_ids:
        album_pop.append(sp.album(album_id)['popularity'])    
    d = {'id': albums_ids,'name': album_names, 'popularity': album_pop}
    df = pd.DataFrame(data=d)    
    df = df.sort_values('popularity', ascending=False).drop_duplicates('name').sort_index()
    df = df.reset_index(drop = True)   
    return (df['id'])

def get_artist_albums_tracks_ids(albums_ids):
    # Given a list of albums ids, returns a list of ids of the albums' tracks
    tracks_ids = []
    for album_id in albums_ids:
        for i in range(len(sp.album_tracks(album_id)['items'])):
            tracks_ids.append(sp.album_tracks(album_id)['items'][i]['id'])
    return(tracks_ids)

def get_track_features(id):
    # Given a track id, returns a nested list with its musical features as described in the introduction
    meta = sp.track(id)
    features = sp.audio_features(id)

    # Meta
    name = meta['name']
    album = meta['album']['name']
    artist = meta['album']['artists'][0]['name']
    release_date = meta['album']['release_date']
    length = meta['duration_ms']
    popularity = meta['popularity']

    # Features
    acousticness = features[0]['acousticness']
    danceability = features[0]['danceability']
    energy = features[0]['energy']
    instrumentalness = features[0]['instrumentalness']
    liveness = features[0]['liveness']
    loudness = features[0]['loudness']
    speechiness = features[0]['speechiness']
    valence = features[0]['valence']
    tempo = features[0]['tempo']
    time_signature = features[0]['time_signature']

    track = [name, album, artist, release_date, length, popularity, danceability, acousticness, energy, instrumentalness, liveness, loudness, speechiness, valence, tempo, time_signature]
    return track

def get_tracks_features(tracks_ids):
    # given a list of ids of tracks, returns a list with their features
    tracks = []
    for i in range(0, len(tracks_ids)):
        track = get_track_features(tracks_ids[i])
        tracks.append(track)
    return(tracks)

def tracks_features_to_csv(tracks_features, csv_title):
    # transforms the list of track features into a csv file
    df = pd.DataFrame(tracks_features, columns = ['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'valence','tempo', 'time_signature'])
    df.to_csv(csv_title + ".csv", sep = ',')
    
def get_discography(artist_name):
    # a function that returns the discography of an artist given its name (string)
    # do not use, it is quite unreliable since the discography is not filtered
    tracks_features = get_tracks_features(get_artist_albums_tracks_ids(filter_albums_ids(get_artist_albums_ids(get_artist(artist_name)))))
    return tracks_features_to_csv(tracks_features, artist_name)

def get_playlist_track_ids(user, playlist_id):
    # initially I wanted to use a playlist to get an artist's discography
    # however, finding a good playlist proved to be quite difficult
    ids = []
    playlist = sp.user_playlist(user, playlist_id)
    for item in playlist['tracks']['items']:
        track = item['track']
        ids.append(track['id'])
    return ids

Done! Let's see the functions in action.

In [4]:
# first, we find the artist object
kendrick = get_artist("Kendrick Lamar")
# we look for his albums
albums_ids = get_artist_albums_ids(kendrick)
# we filter this list since spotify has many versions of the same album. We keep the most popular one
filtered_albums_ids = filter_albums_ids(albums_ids)
# we get the id of every track in his discography
tracks_ids = get_artist_albums_tracks_ids(filtered_albums_ids)
# we get the features of every track
tracks_features = get_tracks_features(tracks_ids)
# we export the data to csv format
tracks_features_to_csv(tracks_features, "Kendrick Lamar")

The API tends to be quite unreliable. At times the process might yield an error so the previous step might require more than one attempo. Anyways, I got the data in the second try, let's see how it looks

In [5]:
df_kendrick = pd.read_csv("Kendrick Lamar.csv",index_col=[0])
print(df_kendrick['album'].unique())
df_kendrick
['Black Panther The Album Music From And Inspired By'
 'DAMN. COLLECTORS EDITION.' 'DAMN.' 'untitled unmastered.'
 'To Pimp A Butterfly' 'good kid, m.A.A.d city (Deluxe)'
 'good kid, m.A.A.d city' 'Section.80' 'Overly Dedicated']
Out[5]:
name album artist release_date length popularity danceability acousticness energy instrumentalness liveness loudness speechiness valence tempo time_signature
0 Black Panther Black Panther The Album Music From And Inspire... Kendrick Lamar 2018-02-09 130613 58 0.618 0.6250 0.582 0.000004 0.2650 -9.454 0.2970 0.480 90.035 4
1 All The Stars (with SZA) Black Panther The Album Music From And Inspire... Kendrick Lamar 2018-02-09 232186 79 0.698 0.0605 0.633 0.000194 0.0926 -4.946 0.0597 0.552 96.924 4
2 X (with 2 Chainz & Saudi) Black Panther The Album Music From And Inspire... Kendrick Lamar 2018-02-09 267426 70 0.768 0.0201 0.471 0.000000 0.2680 -8.406 0.2590 0.405 131.023 4
3 The Ways (with Swae Lee) Black Panther The Album Music From And Inspire... Kendrick Lamar 2018-02-09 238893 66 0.727 0.0626 0.720 0.000001 0.1760 -5.856 0.0488 0.589 140.080 4
4 Opps (with Yugen Blakrok) Black Panther The Album Music From And Inspire... Kendrick Lamar 2018-02-09 180893 60 0.706 0.1520 0.775 0.000033 0.4160 -6.819 0.3350 0.847 127.929 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
118 Barbed Wire Overly Dedicated Kendrick Lamar 2010-09-14 265678 44 0.613 0.0152 0.843 0.000000 0.0917 -8.343 0.1860 0.325 102.968 4
119 Average Joe Overly Dedicated Kendrick Lamar 2010-09-14 256048 46 0.740 0.2430 0.733 0.000000 0.1840 -3.343 0.2480 0.218 91.603 4
120 H.O.C Overly Dedicated Kendrick Lamar 2010-09-14 316975 44 0.613 0.1050 0.591 0.000000 0.1910 -8.580 0.3910 0.371 77.124 4
121 Cut You Off (To Grow Closer) Overly Dedicated Kendrick Lamar 2010-09-14 364103 47 0.685 0.0770 0.681 0.000000 0.1290 -7.176 0.4810 0.614 82.982 4
122 She Needs Me (Remix) Overly Dedicated Kendrick Lamar 2010-09-14 195790 53 0.606 0.4170 0.835 0.000001 0.1030 -6.107 0.2650 0.313 100.145 4

123 rows × 16 columns

Looking good. However, notice that we still got some repeated observations because of the deluxe edition of good kid, m.A.A.d city and the collector's edition of DAMN.. I will keep the former and drop the latter because of popularity reasons. Moreover, I will be dropping the Black Panther's soundtrack since I believe it does not stand as a singular effort by Kendrick, but rather as a collective effort that seeks to accompany the movie.

In [6]:
df_kendrick = df_kendrick[df_kendrick['album'] != 'good kid, m.A.A.d city']
df_kendrick = df_kendrick[df_kendrick['album'] != 'Black Panther The Album Music From And Inspired By']
df_kendrick = df_kendrick[df_kendrick['album'] != 'DAMN. COLLECTORS EDITION.']
df_kendrick = df_kendrick.reset_index(drop=True)
df_kendrick.head()
Out[6]:
name album artist release_date length popularity danceability acousticness energy instrumentalness liveness loudness speechiness valence tempo time_signature
0 BLOOD. DAMN. Kendrick Lamar 2017-04-14 118066 61 0.357 0.14200 0.238 0.085900 0.5500 -16.780 0.265 0.494 156.907 4
1 DNA. DAMN. Kendrick Lamar 2017-04-14 185946 78 0.638 0.00454 0.523 0.000000 0.0842 -6.664 0.357 0.422 139.913 4
2 YAH. DAMN. Kendrick Lamar 2017-04-14 160293 64 0.670 0.57600 0.700 0.000005 0.2260 -7.893 0.196 0.648 69.986 4
3 ELEMENT. DAMN. Kendrick Lamar 2017-04-14 208733 71 0.748 0.20400 0.705 0.000000 0.2460 -4.547 0.485 0.483 189.891 4
4 FEEL. DAMN. Kendrick Lamar 2017-04-14 214826 64 0.746 0.13700 0.798 0.000000 0.1390 -8.382 0.349 0.553 109.968 4

Getting the data from Genius

I will be using the LyricsGenius package that provides a simple interface to the song, artist, and lyrics data stored on Genius.

In [7]:
pip install git+https://github.com/johnwmillr/LyricsGenius.git
Collecting git+https://github.com/johnwmillr/LyricsGenius.git
  Cloning https://github.com/johnwmillr/LyricsGenius.git to /tmp/pip-req-build-_w3rl_eu
  Running command git clone -q https://github.com/johnwmillr/LyricsGenius.git /tmp/pip-req-build-_w3rl_eu
Requirement already satisfied (use --upgrade to upgrade): lyricsgenius==1.8.2 from git+https://github.com/johnwmillr/LyricsGenius.git in /opt/conda/lib/python3.7/site-packages
Requirement already satisfied: beautifulsoup4==4.6.0 in /opt/conda/lib/python3.7/site-packages (from lyricsgenius==1.8.2) (4.6.0)
Requirement already satisfied: requests>=2.20.0 in /opt/conda/lib/python3.7/site-packages (from lyricsgenius==1.8.2) (2.23.0)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->lyricsgenius==1.8.2) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->lyricsgenius==1.8.2) (1.25.7)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->lyricsgenius==1.8.2) (2019.11.28)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20.0->lyricsgenius==1.8.2) (2.9)
Building wheels for collected packages: lyricsgenius
  Building wheel for lyricsgenius (setup.py) ... done
  Created wheel for lyricsgenius: filename=lyricsgenius-1.8.2-py3-none-any.whl size=15038 sha256=8cd1776e3a526e58217a9324415b1e95849fa37880a194dc8f2d22f4fd7ab40f
  Stored in directory: /tmp/pip-ephem-wheel-cache-60w3__2c/wheels/12/d5/2b/6b771ebb067bceb8816ec5eef0dd0d36bf069b18f03ac8ca20
Successfully built lyricsgenius
Note: you may need to restart the kernel to use updated packages.

Like before, we set the credentials obtained from my Genius app.

In [8]:
import lyricsgenius

# Genius API
genius = lyricsgenius.Genius("BCf3jrIhyBKfxJSB7t8ELCDvW9ASizVNtK8VCaka8TuM3Igd3Tma5eDumwizSicV")
genius.remove_section_headers = True

My plan is to iterate through my dataframe looking for every song name and creating a list with the lyrics. I decided against applying a function to the dataframe since the Genius API is very unreliable, (it like to throw out errors at random: at times it works, at times is just does not!). Albeit this approach is more lengthy, it is much more reliable. Before doing any of this, I will need to do some name standarization between my dataframe and the Genius API (titles that include the word featuring are problematic). Please forgive the profanity in the following cell.

In [9]:
df_kendrick.loc[df_kendrick['name'] ==  "Swimming Pools (Drank) - Extended Version", 'name'] = "Swimming Pools (Drank)"
df_kendrick.loc[df_kendrick['name'] ==  "F*ck Your Ethnicity", 'name'] = "Fuck Your Ethnicity"
df_kendrick.loc[df_kendrick['name'] ==  "LOYALTY. FEAT. RIHANNA.", 'name'] = "LOYALTY."
df_kendrick.loc[df_kendrick['name'] ==  "LOVE. FEAT. ZACARI.", 'name'] = "LOVE."
df_kendrick.loc[df_kendrick['name'] ==  "XXX. FEAT. U2.", 'name'] = "XXX."

We are good to go! Let's see how the loop performs.

In [10]:
lyrics = []
for name in df_kendrick['name']:
    lyrics.append(genius.search_song(name, "Kendrick Lamar"))
Searching for "BLOOD." by Kendrick Lamar...
Done.
Searching for "DNA." by Kendrick Lamar...
Done.
Searching for "YAH." by Kendrick Lamar...
Done.
Searching for "ELEMENT." by Kendrick Lamar...
Done.
Searching for "FEEL." by Kendrick Lamar...
Done.
Searching for "LOYALTY." by Kendrick Lamar...
Done.
Searching for "PRIDE." by Kendrick Lamar...
Done.
Searching for "HUMBLE." by Kendrick Lamar...
Done.
Searching for "LUST." by Kendrick Lamar...
Done.
Searching for "LOVE." by Kendrick Lamar...
Done.
Searching for "XXX." by Kendrick Lamar...
Done.
Searching for "FEAR." by Kendrick Lamar...
Done.
Searching for "GOD." by Kendrick Lamar...
Done.
Searching for "DUCKWORTH." by Kendrick Lamar...
Done.
Searching for "untitled 01 | 08.19.2014." by Kendrick Lamar...
Done.
Searching for "untitled 02 | 06.23.2014." by Kendrick Lamar...
Done.
Searching for "untitled 03 | 05.28.2013." by Kendrick Lamar...
Done.
Searching for "untitled 04 | 08.14.2014." by Kendrick Lamar...
Done.
Searching for "untitled 05 | 09.21.2014." by Kendrick Lamar...
Done.
Searching for "untitled 06 | 06.30.2014." by Kendrick Lamar...
Done.
Searching for "untitled 07 | 2014 - 2016" by Kendrick Lamar...
Done.
Searching for "untitled 08 | 09.06.2014." by Kendrick Lamar...
Done.
Searching for "Wesley's Theory" by Kendrick Lamar...
Done.
Searching for "For Free? - Interlude" by Kendrick Lamar...
Done.
Searching for "King Kunta" by Kendrick Lamar...
Done.
Searching for "Institutionalized" by Kendrick Lamar...
Done.
Searching for "These Walls" by Kendrick Lamar...
Done.
Searching for "u" by Kendrick Lamar...
Done.
Searching for "Alright" by Kendrick Lamar...
Done.
Searching for "For Sale? - Interlude" by Kendrick Lamar...
Done.
Searching for "Momma" by Kendrick Lamar...
Done.
Searching for "Hood Politics" by Kendrick Lamar...
Done.
Searching for "How Much A Dollar Cost" by Kendrick Lamar...
Done.
Searching for "Complexion (A Zulu Love)" by Kendrick Lamar...
Done.
Searching for "The Blacker The Berry" by Kendrick Lamar...
Done.
Searching for "You Ain't Gotta Lie (Momma Said)" by Kendrick Lamar...
Done.
Searching for "i" by Kendrick Lamar...
Done.
Searching for "Mortal Man" by Kendrick Lamar...
Done.
Searching for "Sherane a.k.a Master Splinter’s Daughter" by Kendrick Lamar...
Done.
Searching for "Bitch, Don’t Kill My Vibe" by Kendrick Lamar...
Done.
Searching for "Backseat Freestyle" by Kendrick Lamar...
Done.
Searching for "The Art of Peer Pressure" by Kendrick Lamar...
Done.
Searching for "Money Trees" by Kendrick Lamar...
Done.
Searching for "Poetic Justice" by Kendrick Lamar...
Done.
Searching for "good kid" by Kendrick Lamar...
Done.
Searching for "m.A.A.d city" by Kendrick Lamar...
Done.
Searching for "Swimming Pools (Drank)" by Kendrick Lamar...
Done.
Searching for "Sing About Me, I'm Dying Of Thirst" by Kendrick Lamar...
Done.
Searching for "Real" by Kendrick Lamar...
Done.
Searching for "Compton" by Kendrick Lamar...
Done.
Searching for "The Recipe - Bonus Track" by Kendrick Lamar...
Done.
Searching for "Black Boy Fly - Bonus Track" by Kendrick Lamar...
Done.
Searching for "Now Or Never - Bonus Track" by Kendrick Lamar...
Done.
Searching for "The Recipe (Black Hippy Remix) - Bonus Track" by Kendrick Lamar...
Done.
Searching for "Bitch, Don’t Kill My Vibe - Remix" by Kendrick Lamar...
Done.
Searching for "Fuck Your Ethnicity" by Kendrick Lamar...
Done.
Searching for "Hol' Up" by Kendrick Lamar...
Done.
Searching for "A.D.H.D" by Kendrick Lamar...
Done.
Searching for "No Make-Up (Her Vice) (feat. Colin Munroe)" by Kendrick Lamar...
Done.
Searching for "Tammy's Song (Her Evils)" by Kendrick Lamar...
Done.
Searching for "Chapter Six" by Kendrick Lamar...
Done.
Searching for "Ronald Reagan Era" by Kendrick Lamar...
Done.
Searching for "Poe Mans Dreams (His Vice) (feat. GLC)" by Kendrick Lamar...
Done.
Searching for "Chapter Ten" by Kendrick Lamar...
Done.
Searching for "Keisha's Song (Her Pain) (feat. Ashtro Bot)" by Kendrick Lamar...
Done.
Searching for "Rigamortus" by Kendrick Lamar...
Done.
Searching for "Kush & Corinthians (feat. BJ The Chicago Kid)" by Kendrick Lamar...
Done.
Searching for "Blow My High (Members Only)" by Kendrick Lamar...
Done.
Searching for "Ab-Souls Outro (feat. Ab-Soul)" by Kendrick Lamar...
Done.
Searching for "HiiiPower" by Kendrick Lamar...
Done.
Searching for "Growing Apart (To Get Closer)" by Kendrick Lamar...
Done.
Searching for "Ignorance Is Bliss" by Kendrick Lamar...
Done.
Searching for "P&P 1.5" by Kendrick Lamar...
Done.
Searching for "Alien Girl (Today W/ Her)" by Kendrick Lamar...
Done.
Searching for "Opposites Attract (Tomorrow W/O Her)" by Kendrick Lamar...
Done.
Searching for "Michael Jordan" by Kendrick Lamar...
Done.
Searching for "R.O.T.C (Interlude)" by Kendrick Lamar...
Done.
Searching for "Barbed Wire" by Kendrick Lamar...
Done.
Searching for "Average Joe" by Kendrick Lamar...
Done.
Searching for "H.O.C" by Kendrick Lamar...
Done.
Searching for "Cut You Off (To Grow Closer)" by Kendrick Lamar...
Done.
Searching for "She Needs Me (Remix)" by Kendrick Lamar...
Done.

A quck inspection of the list of lyrics reveals that we got a few None values. Let's fix those:

In [11]:
while None in lyrics:
    none_index = [i for i in range(len(lyrics)) if lyrics[i] == None] 
    for i in none_index:
        lyrics[i] = genius.search_song(df_kendrick['name'][i], "Kendrick Lamar")

We are ready to extract the lyrics and append them into the dataframe.

In [12]:
lyrics_text = []
for lyric in lyrics:
    lyrics_text.append(lyric.lyrics.lower())
    
df_kendrick['lyrics'] = lyrics_text
df_kendrick.head()
Out[12]:
name album artist release_date length popularity danceability acousticness energy instrumentalness liveness loudness speechiness valence tempo time_signature lyrics
0 BLOOD. DAMN. Kendrick Lamar 2017-04-14 118066 61 0.357 0.14200 0.238 0.085900 0.5500 -16.780 0.265 0.494 156.907 4 is it wickedness?\nis it weakness?\nyou decide...
1 DNA. DAMN. Kendrick Lamar 2017-04-14 185946 78 0.638 0.00454 0.523 0.000000 0.0842 -6.664 0.357 0.422 139.913 4 i got, i got, i got, i got—\nloyalty, got roya...
2 YAH. DAMN. Kendrick Lamar 2017-04-14 160293 64 0.670 0.57600 0.700 0.000005 0.2260 -7.893 0.196 0.648 69.986 4 new shit, new kung fu kenny\n\ni got so many t...
3 ELEMENT. DAMN. Kendrick Lamar 2017-04-14 208733 71 0.748 0.20400 0.705 0.000000 0.2460 -4.547 0.485 0.483 189.891 4 new kung fu kenny\nain't nobody prayin' for me...
4 FEEL. DAMN. Kendrick Lamar 2017-04-14 214826 64 0.746 0.13700 0.798 0.000000 0.1390 -8.382 0.349 0.553 109.968 4 ain't nobody prayin' for me\n(ain't nobody pra...

Quantifying Lyrical Sentiment.

In order to quantify lyrical sentiment I will use the NRC Emotion Lexicon. This dataset assigns different emotions or sentiments to english words. For example, the word abandon is mapped to the emotions of fear and sadness. My strategy is to compute the proportions for each of the sentiments. I recognize that this is a very limited approach to lyric analysis but at the same time I believe that it feels like a good starting point given the current constraints.

Let's create a function that given a song's lyrics and a emotion computes the number of words in the song associated to the emotion.

In [13]:
nrc = pd.read_table("NRC-Emotion-Lexicon-Wordlevel-v0.92.txt", header=None, names=['word','emotion','dummy'])
In [14]:
def count_emotion(song_lyrics, emotion):
    nrc_emotion = nrc[nrc['emotion'] == emotion]
    nrc_emotion = nrc_emotion[nrc_emotion['dummy'] == 1]
    nrc_emotion = nrc_emotion.reset_index(drop=True)
    sum_emotion = 0
    for word in nrc_emotion['word']:
        sum_emotion = sum_emotion + song_lyrics.count(word)    
    return sum_emotion

Next, we create a function that computes the number of words associated to each emotion in a song and computes its proportions.

In [15]:
def emotion_prop(song_lyrics):
    emotion_proportions = []
    for emotion in nrc['emotion'].unique():
        emotion_proportions.append(count_emotion(song_lyrics, emotion))
    array = np.array(emotion_proportions)
    return array/array.sum()

Finally, we append these proportions for each song in the dataframe:

In [16]:
for i in range(len(nrc['emotion'].unique())):
    df_kendrick[nrc['emotion'].unique()[i]+"_index"] = df_kendrick.apply(lambda row: emotion_prop(row['lyrics'])[i], axis=1)
    df_kendrick.head()

As desired. We are ready to begin visualizing Kendrick's discography!

Visualization

First I wanted to see what type of lyrical and musical features dominate - in average - in each Kendrick album. To do so, I decided to produce an interactive radar plot with all albums overlayed in top of each other. This plot might look messy at first, but one can select (or unselect) each album by clicking over its legend or plot. In this way, we have a clear yet compact visualization.

In [17]:
lyrical_variables = nrc['emotion'].unique()+"_index"
musical_variables =  ["danceability", "acousticness", "energy", "instrumentalness", "liveness","speechiness", "valence"]
In [18]:
def radar_discography_plot(df,x,musical,export=False):
    artist = df['artist'][0] 
    df = df[::-1]
    df = df.groupby('album',as_index=False,sort=False).mean()
    df = df.drop(['length','popularity','loudness','tempo','time_signature'], axis = 1)
    df_long = pd.melt(df, id_vars='album', value_vars=df.columns.values[1:])
    if musical == True:
        fig = px.line_polar(df_long, r="value", theta="variable", color="album", line_close=True, range_r=[0,1],
                    color_discrete_sequence=px.colors.qualitative.Bold)
    else:
        fig = px.line_polar(df_long, r="value", theta="variable", color="album", line_close=True, range_r=[0,0.3],
                    color_discrete_sequence=px.colors.qualitative.Bold)
    if musical == True:
        fig.update_layout(title = {'text':"Musical Features of " + artist + "s' Discography",'y':0.98,'x':x,'xanchor': 'center','yanchor': 'top'})
    else:
        fig.update_layout(title = {'text':"Lyrical Features of " + artist + "s' Discography",'y':0.98,'x':x,'xanchor': 'center','yanchor': 'top'})
    fig.show() 
    if export == True:
        fig.write_html(artist+".html")

First let's visualize the musical features of each of his albums.

In [19]:
radar_discography_plot(df_kendrick.drop(lyrical_variables,axis=1).copy(),0.45, musical = True)
radar_discography_plot(df_kendrick.drop(musical_variables,axis=1).copy(),0.45, musical = False)