Spotify Data Exploration: the Popularity Feature

Intro:

After retrieving some data from the Spotify API (for more info about that check out this notebook) it's time to get some insights. In this notebook, I will use data collected during the months of August and September 2018 to identify the most popular tracks and artists on Spotify using the 'popularity' featue.

About the Popularity Feature:

From the official Spotify docs:

"The popularity of the track. The value will be between 0, for least popular, and 100 for most popular. The popularity of a track is a value between 0 and 100, with 100 being the most popular. Popularity is based mainly on the total number of playbacks. Duplicate tracks, such as both in a single and in an album, are popularity rated differently. Note: This value is not updated in real-time and may therefore lag behind in actual popularity."

Goal of this Notebook:

The goal is to use the previously retrieved data to gain insights from the popularity feature such as most popular tracks and most popular artists by analyzing and visualizing the data using Python libraries Pandas, Numpy and Matplotlib.

In [1]:
# import libraries
import glob, os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# get all csv files into one variable
path = 'Datasets/Summer2018'
all_files = glob.glob(os.path.join(path, "*.csv"))

# create lists of columns to be used when reading/merging the csv's
columns = ['artist_name','track_id', 'track_name', 'popularity']
merge_columns = ['artist_name','track_id', 'track_name']

# create dataframes by reading the csv's in all_files
df_from_each_file = (pd.read_csv(f, usecols=columns) for f in all_files)

# create empty dataframe with the defined column structure
df = pd.DataFrame(columns=columns)

# loop over dataframes and merge into one dataframe
# outer join in order to keep the popularity column from each file
for df_, files in zip(df_from_each_file, all_files): # all_files are here to provide the column suffix (0920,0830 etc)
    df = df.merge(df_, how='outer', on=merge_columns, suffixes=('',str(files)[-8:-4]))

print('Shape: ', df.shape)
df.head()
Shape:  (13257, 7)
Out[1]:
popularity artist_name popularity0920 track_id track_name popularity0830 popularity0807
0 NaN Travis Scott 96.0 2xLMifQCjDGFmkHkpNLD9h SICKO MODE 94.0 86.0
1 NaN 6ix9ine 96.0 2E124GmJRnBJuXbTb4cPUB FEFE (feat. Nicki Minaj & Murda Beatz) 95.0 94.0
2 NaN Juice WRLD 96.0 0s3nnoMeVWz3989MkNQiRf Lucid Dreams 95.0 95.0
3 NaN Drake 100.0 2G7V7zsVDxg1yRsu7Ew9RJ In My Feelings 100.0 100.0
4 NaN XXXTENTACION 95.0 3ee8Jmje8o58CHK66QrVC2 SAD! 95.0 98.0

Since I have merged 3 files based on artist and track names there shouldn't be a lot duplicates.

However, it is still worth to do a quick drop_duplicates here.

In [2]:
# drop duplicate tracks
df.drop_duplicates(subset=['artist_name','track_name'], inplace=True)
print('Shape after dropping: ', df.shape)
Shape after dropping:  (12933, 7)

1. Top 50 most Popular Tracks

In [3]:
# sum individual popularity scores
df['popularity'] = df[['popularity0920', 'popularity0830', 'popularity0807']].sum(axis=1)

# calculate also the mean popularity score
df['popularity_mean'] = df[['popularity0920', 'popularity0830', 'popularity0807']].mean(axis=1)

# create new dataframe df_top ordered consisting of the 100 most popular tracks
df_top = df.sort_values('popularity', ascending=False).head(100)

# show the first 50 results
df_top[['artist_name', 'track_name', 'popularity', 'popularity_mean']].head(50)
Out[3]:
artist_name track_name popularity popularity_mean
3 Drake In My Feelings 300.0 100.000000
4 XXXTENTACION SAD! 288.0 96.000000
12 Cardi B I Like It 288.0 96.000000
6 Tyga Taste (feat. Offset) 287.0 95.666667
10 Post Malone Better Now 287.0 95.666667
80 Clean Bandit Solo (feat. Demi Lovato) 287.0 95.666667
2 Juice WRLD Lucid Dreams 286.0 95.333333
72 Calvin Harris One Kiss (with Dua Lipa) 285.0 95.000000
1 6ix9ine FEFE (feat. Nicki Minaj & Murda Beatz) 285.0 95.000000
9 benny blanco Eastside (with Halsey & Khalid) 284.0 94.666667
33 Tiësto Jackie Chan 284.0 94.666667
20 Maroon 5 Girls Like You (feat. Cardi B) 284.0 94.666667
55 Jonas Blue Rise 283.0 94.333333
14 DJ Khaled No Brainer 283.0 94.333333
40 Nio Garcia Te Boté - Remix 282.0 94.000000
7 XXXTENTACION Moonlight 281.0 93.666667
58 Dynoro In My Mind 280.0 93.333333
144 Becky G Sin Pijama 278.0 92.666667
0 Travis Scott SICKO MODE 276.0 92.000000
87 Martin Garrix Ocean (feat. Khalid) 275.0 91.666667
21 XXXTENTACION changes 275.0 91.666667
163 Ozuna Vaina Loca 274.0 91.333333
8 Drake Nonstop 274.0 91.333333
45 5 Seconds of Summer Youngblood 274.0 91.333333
18 Post Malone Psycho (feat. Ty Dolla $ign) 273.0 91.000000
25 Post Malone rockstar (feat. 21 Savage) 273.0 91.000000
31 A$AP Rocky Praise The Lord (Da Shine) 271.0 90.333333
240 Nicky Jam X 271.0 90.333333
134 Marshmello FRIENDS 270.0 90.000000
11 Lil Baby Yes Indeed 270.0 90.000000
23 Drake God's Plan 269.0 89.666667
184 Shakira Clandestino 269.0 89.666667
13 Travis Scott STARGAZING 269.0 89.666667
397 David Guetta Flames 268.0 89.333333
263 Reik Me Niego 268.0 89.333333
177 George Ezra Shotgun 268.0 89.333333
66 Drake Don’t Matter To Me (with Michael Jackson) 268.0 89.333333
37 Zedd Happy Now 268.0 89.333333
89 Zedd The Middle 267.0 89.000000
35 Dean Lewis Be Alright 267.0 89.000000
107 Camila Cabello Havana 267.0 89.000000
43 Billie Eilish lovely (with Khalid) 266.0 88.666667
28 BlocBoy JB Look Alive (feat. Drake) 266.0 88.666667
170 Nicky Jam X - Remix 266.0 88.666667
393 Sofia Reyes 1, 2, 3 (feat. Jason Derulo & De La Ghetto) 266.0 88.666667
130 Selena Gomez Back To You - From 13 Reasons Why – Season 2 S... 266.0 88.666667
334 Zion & Lennox La Player (Bandolera) 265.0 88.333333
42 Khalid Love Lies (with Normani) 264.0 88.000000
30 Imagine Dragons Natural 264.0 88.000000
44 Panic! At The Disco High Hopes 263.0 87.666667

2. Top Artists by Popularity

Note: the Spotify API offers a special popularity score on artist-level as well. That score is not used here.

Instead, I have used only the popularity scores of their individual tracks.

In [4]:
# show top 20 artists by number of tracks in top 100
df_top[['artist_name','track_name']].groupby('artist_name').count().sort_values('track_name', ascending=False).head(20)
Out[4]:
track_name
artist_name
Drake 5
XXXTENTACION 5
Travis Scott 5
Post Malone 5
Juice WRLD 3
Khalid 2
David Guetta 2
Daddy Yankee 2
Ozuna 2
Nicky Jam 2
Childish Gambino 2
Selena Gomez 2
Billie Eilish 2
Tyga 2
Zedd 2
Marshmello 1
Maroon 5 1
Martin Garrix 1
Migos 1
Maluma 1
In [5]:
# show top 20 artists by total popularity of their tracks in top 100
df_top[['artist_name','popularity']].groupby('artist_name').sum().sort_values('popularity', ascending=False).head(20)
Out[5]:
popularity
artist_name
Drake 1364.0
XXXTENTACION 1362.0
Post Malone 1343.0
Travis Scott 1303.0
Juice WRLD 797.0
Tyga 544.0
Nicky Jam 537.0
Ozuna 535.0
Zedd 535.0
David Guetta 525.0
Khalid 524.0
Selena Gomez 520.0
Billie Eilish 518.0
Daddy Yankee 515.0
Childish Gambino 513.0
Cardi B 288.0
Clean Bandit 287.0
6ix9ine 285.0
Calvin Harris 285.0
benny blanco 284.0

4. Visualizing Popularity

For this visualization I borrowed the code from another project of mine - Twitter 10k (plot number 5).

In [6]:
# create a new transposed dataframe where the track names are the columns and individual popularities the rows
df_top10_pop = df_top[['track_name','popularity0807','popularity0830','popularity0920']].set_index('track_name').head(10).T

# set the figure size
plt.figure(figsize=(12,18))
 
# create a color palette
palette = plt.get_cmap('Set1')

# multiple line plot of the top10 track popularities
num=0
for track in df_top10_pop.columns:
    num+=1
 
    # find the right spot on the plot
    plt.subplot(10,1, num)
    
    # plot the individual popularities
    df_top10_pop.loc[['popularity0807', 'popularity0830', 'popularity0920'],track].plot(marker='', color=palette(num), linewidth=2.5)
    
    # same limits for every subplot
    plt.ylim(90,100)
    
    # get current position of the ticks
    locs, labels = plt.xticks()

    # add ticks with custom labels
    mylabels = ['','7th August', '','', '','30th August', '','','', '20th September'] # a bit ugly but it works
    plt.xticks(locs, mylabels)

    # not ticks everywhere
    if num in range(10) :
        plt.tick_params(labelbottom=False)
        
    # add title
    plt.title(track, loc='left', fontsize=10, fontweight=0, color=palette(num))
    
# add general title
plt.suptitle("Popularity of Top 10 Tracks during Summer 2018", fontsize=13, fontweight=0, color='black', style='italic');