This notebook contains some code snippets I used in writing the following blog post: https://dev.to/ericbonfadini/finding-my-new-favorite-song-on-spotify-4lgc
While I'm developing (or just while I'm commuting to work) I usually love to hear some rock music.
I created some playlists on Spotify, but lately I'm stick to the same playlist, containing my favorite "Indie Rock" songs. This playlist is made up of more or less 45 songs I discovered through the years in several ways.
Since I was starting to get bored about always listening to the same songs, last weekend I decided to analyze my playlist using Spotify APIs in order to discover insights and hopefully to find some new tunes I could add.
Here's what I did in more or less 300 lines of Python 3 code (boilerplate included).
For my analysis I set up a Python 3 virtual environment with the following libraries:
In order to access the Spotify APIs I registered my app and then I provided the spotipy library with the client_id, the client_secret and a redirect url.
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import spotipy
from matplotlib import style
from spotipy import util
# Improve Pandas display settings
pd.set_option("display.width", 600)
pd.set_option("display.max_columns", 50)
# Change Seaborn default settings
sns.set_context('talk')
style.use('ggplot')
# Spotify API limit
API_LIMIT = 50
After a first API call to get my playlist id, I got all the tracks of my playlist with another API call along with some basic information like: song id, song name, artist id, artist name, album name, song popularity.
With another API call I got some extra information about the artists in my playlist, like genres and artist_popularity.
Finally with another API call I got some insightful information about my tracks:
The results of all these calls have been put inside Pandas dataframes in order to simplify the data analysis and then merged in a single dataframe using artist IDs and track IDs. Some values (like song/artist popularity and tempo) have been normalized.
username = "eric.bonfadini"
playlist_name = "Indie Rock"
# Get token
scope = 'playlist-modify-public'
spotify_token = util.prompt_for_user_token(username, scope)
# Create client
sp = spotipy.Spotify(auth=spotify_token)
# Get playlist ID
playlists_results = sp.user_playlists(username)
playlist_ids = [playlist['id'] for playlist in playlists_results['items'] if playlist['name'] == playlist_name]
if not playlist_ids:
raise Exception("Cannot find playlist named: {}".format(playlist_name))
else:
print(playlist_ids)
# Get tracks
tracks_results = sp.user_playlist(username, playlist_ids[0])
df_tracks = pd.DataFrame([[t["track"]["id"], t["track"]["name"], t["track"]["artists"][0]["id"],
t["track"]["artists"][0]["name"], t["track"]["album"]["name"], t["track"]["popularity"]]
for t in tracks_results['tracks']['items']],
columns=["id", "song_name", "artist_id", "artist_name", "album_name", "popularity"])
# Normalize popularity
df_tracks["popularity_norm"] = df_tracks["popularity"] / 100.
df_tracks.head()
def _get_artists_df(sp, artist_ids):
"""
This is an helper method to get artist's information with pagination from artist ids.
It returns a Pandas dataframe
"""
artist_list = []
i = 0
while artist_ids:
print("Call #{} for artists".format(i + 1))
artists_results = sp.artists(artist_ids[:API_LIMIT])
artist_list += [[t["id"], t["genres"], t["popularity"]] for t in artists_results["artists"]]
artist_ids = artist_ids[API_LIMIT:]
i += 1
df_artists = pd.DataFrame(artist_list, columns=["artist_id", "artist_genres", "artist_popularity"])
df_artists["artist_popularity_norm"] = df_artists["artist_popularity"] / 100.
return df_artists
artist_ids = df_tracks["artist_id"].unique().tolist()
df_artists = _get_artists_df(sp, artist_ids)
df_artists.head()
def _get_features_df(sp, track_ids):
"""
This is an helper method to get track's features with pagination from track ids.
It returns a Pandas dataframe
"""
feature_list = []
i = 0
while track_ids:
print("Call #{} for audio features".format(i + 1))
features_results = sp.audio_features(track_ids[:API_LIMIT])
feature_list += features_results
track_ids = track_ids[API_LIMIT:]
i += 1
df_features = pd.DataFrame(feature_list)[["id", "analysis_url", "duration_ms", "acousticness", "danceability",
"energy", "instrumentalness", "liveness", "loudness", "valence",
"speechiness", "key", "mode", "tempo", "time_signature"]]
# tempo is in range 24-200 ==> 0-176, normalize it
df_features["tempo_norm"] = (df_features["tempo"] - 24) / 176.
return df_features
track_ids = df_tracks["id"].unique().tolist()
df_features = _get_features_df(sp, track_ids)
df_features.head()
# Create a df for current playlist merging the dataframes
df_cur = df_features.merge(df_tracks, on="id")
df_cur = df_cur.merge(df_artists, on="artist_id")
# Create a new column with full name of the song
df_cur["full_name"] = df_cur["artist_name"] + " -- " + df_cur["song_name"]
# Sort by song popularity
df_cur.sort_values("popularity", inplace=True, ascending=False)
df_cur.head()
After ensuring that the artists in my playlist all contain "Indie Rock" as genre, I checked my playlist using shape, info and describe of the full dataframe
print(df_cur.info())
df_cur.describe()
# Convert time_signature and key to category
df_cur["time_signature"] = df_cur["time_signature"].astype(pd.api.types.CategoricalDtype(categories=[1, 2, 3, 4, 5]))
df_cur["key"] = df_cur["key"].astype(pd.api.types.CategoricalDtype(categories=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]))
def _distplot(df, key, label, x_limits):
"""
This is an helper method to plot distribution charts
"""
ax = sns.distplot(df[[key]], bins=30, label=label)
if x_limits is not None:
ax.set_xlim(*x_limits)
plt.title(key)
plt.legend()
plt.show()
x_limits = {"duration_ms": None, "loudness": (-60, 0), "tempo": (24, 200), "popularity": (0, 100),
"artist_popularity": (0, 100)}
for key in ["duration_ms", "acousticness", "danceability", "energy", "instrumentalness", "liveness",
"loudness", "valence", "speechiness", "tempo", "popularity", "artist_popularity"]:
_distplot(df_cur, key, label="My Playlist", x_limits=x_limits.get(key, (0, 1)))
def _countplot(df, key, label):
"""
This is an helper method to plot count charts
"""
ax = sns.countplot(data=df, x=key, palette="tab20")
ax.set_title(label)
plt.show()
for key in ["key", "time_signature", "mode"]:
_countplot(df_cur, key, label="My Playlist")
ax = sns.boxplot(data=df_cur[["acousticness", "danceability", "energy", "instrumentalness", "liveness",
"valence", "speechiness", "artist_popularity_norm", "popularity_norm", "tempo_norm"]])
ax.set_title("My Playlist")
plt.show()
All these graphs show that I like songs with low acousticness/instrumentalness/speechiness, high energy/loudness/tempo, high artist popularity and duration of more or less 200 seconds. Valence and song popularity span on a wide range, meaning that I have in my playlist both well-known and unknown songs, and both positive and negative ones.
But how my playlist compare against the Indie Rock genre?
I used some calls to the search API with 'genre:"Indie Rock"' as a keyword and 'type=tracks' to get a sample of the Indie Rock genre (5000 songs in total). This API offers also some nice keywords like 'tag:hipster' (to get only albums with the lowest 10% popularity) or 'year:1980-2020' (to get only tracks released in a specific year range).
number_of_tracks = 5000
genre = "indie rock"
search_runs = int(number_of_tracks / API_LIMIT)
search_list = []
for i in range(search_runs):
print("Call #{} for tracks".format(i+1))
search_results = sp.search('genre:"{}"'.format(genre), type="track", limit=API_LIMIT, offset=API_LIMIT*i)
search_list += [[t["id"], t["name"], t["artists"][0]["id"], t["artists"][0]["name"],
t["album"]["name"], t["popularity"]]
for t in search_results['tracks']['items']]
df_search = pd.DataFrame(search_list,
columns=["id", "song_name", "artist_id", "artist_name", "album_name", "popularity"])
df_search["popularity_norm"] = df_search["popularity"] / 100.
df_search.head()
track_ids = df_search["id"].unique().tolist()
df_features = _get_features_df(sp, track_ids)
df_features.head()
artist_ids = df_search["artist_id"].unique().tolist()
df_artists = _get_artists_df(sp, artist_ids)
df_artists.head()
df_sample = df_features.merge(df_search, on="id")
df_sample = df_sample.merge(df_artists, on="artist_id")
df_sample["full_name"] = df_sample["artist_name"] + " -- " + df_sample["song_name"]
df_sample.sort_values("popularity", inplace=True, ascending=False)
df_sample.head()
# Convert time_signature and key to category
df_sample["time_signature"] = df_sample["time_signature"].astype(pd.api.types.CategoricalDtype(categories=[1, 2, 3, 4, 5]))
df_sample["key"] = df_sample["key"].astype(pd.api.types.CategoricalDtype(categories=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]))
print(df_sample.info())
df_sample.describe()
Then I repeated the same analysis on both my current playlist and the Indie Rock sample and I got the following charts:
def _distplot2(df, df_other, key, labels, x_limits):
ax = sns.distplot(df[[key]], bins=30, label=labels[0])
if x_limits is not None:
ax.set_xlim(*x_limits)
ax = sns.distplot(df_other[[key]], bins=30, label=labels[1])
if x_limits is not None:
ax.set_xlim(*x_limits)
plt.title(key)
plt.legend()
plt.show()
def _countplot2(df, df_other, key, labels):
fig, ax = plt.subplots(1, 2)
sns.countplot(data=df, x=key, ax=ax[0], palette="tab20")
ax[0].set_title(labels[0])
sns.countplot(data=df_other, x=key, ax=ax[1], palette="tab20")
ax[1].set_title(labels[1])
plt.show()
for key in ["duration_ms", "acousticness", "danceability", "energy", "instrumentalness", "liveness",
"loudness", "valence", "speechiness", "tempo", "popularity", "artist_popularity"]:
_distplot2(df_cur, df_sample, key,
labels=["My Playlist", "5000 Indie rock songs"],
x_limits=x_limits.get(key, (0, 1)))
for key in ["key", "time_signature", "mode"]:
_countplot2(df_cur, df_sample, key, labels=["My Playlist", "5000 Indie rock songs"])
fig, ax = plt.subplots(2, 1)
sns.boxplot(data=df_cur[["acousticness", "danceability", "energy", "instrumentalness", "liveness",
"valence", "speechiness", "artist_popularity_norm", "popularity_norm",
"tempo_norm"]], ax=ax[0])
ax[0].set_title("My Playlist")
sns.boxplot(data=df_sample[["acousticness", "danceability", "energy", "instrumentalness", "liveness",
"valence", "speechiness", "artist_popularity_norm", "popularity_norm",
"tempo_norm"]], ax=ax[1])
ax[1].set_title("5000 Indie rock songs")
plt.show()
The graphs show that my playlist differs from the 5000 Indie Rock songs because:
The boxplot confirms the same insights and I agree with the outcome of this analysis.
Using these insights, I applied some filters to the 5000 songs dataframe in order to keep only tracks I potentially like; for each step I logged the dropped songs to double check the filter behavior.
The first filter I applied was removing songs I already had in my original playlist, obviously.
The other filters were:
def _apply_condition(df, condition, label):
before = len(df)
dropped_songs = df[~condition]["full_name"].head().tolist()
df = df[condition]
print("\ncondition [{}]: {}-{}={}".format(label, before, before - len(df), len(df)))
print("first 10 dropped songs: {}".format(dropped_songs))
return df
df_new = df_sample.drop_duplicates(["full_name"], keep="first")
df_new = _apply_condition(df_new,
condition=~(df_new["full_name"]).isin((df_cur["full_name"]).tolist()),
label="name")
df_new = _apply_condition(df_new,
condition=(df_new["acousticness"] < 0.1),
label="acousticness")
df_new = _apply_condition(df_new,
condition=(df_new["energy"] > 0.75),
label="energy")
df_new = _apply_condition(df_new,
condition=(df_new["loudness"] > -7),
label="loudness")
df_new = _apply_condition(df_new,
condition=(df_new["valence"].between(0.3, 0.9)),
label="valence")
df_new = _apply_condition(df_new,
condition=(df_new["tempo"] > 120),
label="tempo")
df_new = _apply_condition(df_new,
condition=(df_new["key"].isin([9, 0, 1, 6])),
label="key")
df_new = _apply_condition(df_new,
condition=(df_new["duration_ms"].between(*df_cur["duration_ms"].quantile([0.1, 0.9]))),
label="duraton_ms")
df_new.head()
In the end I got a dataframe with 220 tracks and I created a new playlist using an API call
playlist_name_new = "Indie Rock Trial"
playlists = sp.user_playlists(username)
playlist_ids = [playlist['id'] for playlist in playlists['items'] if playlist['name'] == playlist_name_new]
if not playlist_ids:
playlists = sp.user_playlist_create(username, playlist_name_new)
playlist_id = playlists["id"]
else:
playlist_id = playlist_ids[0]
results = sp.user_playlist(username, playlist_id)
track_ids = [t["track"]["id"] for t in results["tracks"]["items"]]
results = sp.user_playlist_remove_all_occurrences_of_tracks(username, playlist_id, track_ids)
print(results)
track_ids = df_new["id"].unique().tolist()
while track_ids:
results = sp.user_playlist_add_tracks(username, playlist_id, track_ids[:API_LIMIT])
print(results)
track_ids = track_ids[API_LIMIT:]
After a few days listening to the new playlist, I'm quite happy with the results and I'm already promoting some tracks to my original playlist.
This "recommendation" method is really simple and probably works well only in this specific use case (i.e. it's a well definable subset of a specific genre). The standard recommendation method of Spotify is obviously much better because, apart from audio analysis, it uses also a mix of Collaborative Filtering models (analyzing your behavior and others’ behavior) and Natural Language Processing (NLP) models (analyzing text of the songs).
Next steps: