Spotify Data Exploration: the Popularity Feature¶

Intro:¶

After retrieving some data from the Spotify API (for more info about that check out this notebook) it's time to get some insights. In this notebook, I will use data collected during the months of August and September 2018 to identify the most popular tracks and artists on Spotify using the 'popularity' featue.

About the Popularity Feature:¶

From the official Spotify docs:

"The popularity of the track. The value will be between 0, for least popular, and 100 for most popular.

The popularity of a track is a value between 0 and 100, with 100 being the most popular. Popularity is based mainly on the total number of playbacks. Duplicate tracks, such as both in a single and in an album, are popularity rated differently. Note: This value is not updated in real-time and may therefore lag behind in actual popularity."

Goal of this Notebook:¶

The goal is to use the previously retrieved data to gain insights from the popularity feature such as most popular tracks and most popular artists by analyzing and visualizing the data using Python libraries Pandas, Numpy and Matplotlib.

In [1]:

# import libraries
import glob, os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# get all csv files into one variable
path = 'Datasets/Summer2018'
all_files = glob.glob(os.path.join(path, "*.csv"))

# create lists of columns to be used when reading/merging the csv's
columns = ['artist_name','track_id', 'track_name', 'popularity']
merge_columns = ['artist_name','track_id', 'track_name']

# create dataframes by reading the csv's in all_files
df_from_each_file = (pd.read_csv(f, usecols=columns) for f in all_files)

# create empty dataframe with the defined column structure
df = pd.DataFrame(columns=columns)

# loop over dataframes and merge into one dataframe
# outer join in order to keep the popularity column from each file
for df_, files in zip(df_from_each_file, all_files): # all_files are here to provide the column suffix (0920,0830 etc)
    df = df.merge(df_, how='outer', on=merge_columns, suffixes=('',str(files)[-8:-4]))

print('Shape: ', df.shape)
df.head()

Shape:  (13257, 7)

Out[1]:

	popularity	artist_name	popularity0920	track_id	track_name	popularity0830	popularity0807
0	NaN	Travis Scott	96.0	2xLMifQCjDGFmkHkpNLD9h	SICKO MODE	94.0	86.0
1	NaN	6ix9ine	96.0	2E124GmJRnBJuXbTb4cPUB	FEFE (feat. Nicki Minaj & Murda Beatz)	95.0	94.0
2	NaN	Juice WRLD	96.0	0s3nnoMeVWz3989MkNQiRf	Lucid Dreams	95.0	95.0
3	NaN	Drake	100.0	2G7V7zsVDxg1yRsu7Ew9RJ	In My Feelings	100.0	100.0
4	NaN	XXXTENTACION	95.0	3ee8Jmje8o58CHK66QrVC2	SAD!	95.0	98.0

Since I have merged 3 files based on artist and track names there shouldn't be a lot duplicates.

However, it is still worth to do a quick drop_duplicates here.

In [2]:

# drop duplicate tracks
df.drop_duplicates(subset=['artist_name','track_name'], inplace=True)
print('Shape after dropping: ', df.shape)

Shape after dropping:  (12933, 7)

1. Top 50 most Popular Tracks¶

In [3]:

# sum individual popularity scores
df['popularity'] = df[['popularity0920', 'popularity0830', 'popularity0807']].sum(axis=1)

# calculate also the mean popularity score
df['popularity_mean'] = df[['popularity0920', 'popularity0830', 'popularity0807']].mean(axis=1)

# create new dataframe df_top ordered consisting of the 100 most popular tracks
df_top = df.sort_values('popularity', ascending=False).head(100)

# show the first 50 results
df_top[['artist_name', 'track_name', 'popularity', 'popularity_mean']].head(50)

Out[3]:

	artist_name	track_name	popularity	popularity_mean
3	Drake	In My Feelings	300.0	100.000000
4	XXXTENTACION	SAD!	288.0	96.000000
12	Cardi B	I Like It	288.0	96.000000
6	Tyga	Taste (feat. Offset)	287.0	95.666667
10	Post Malone	Better Now	287.0	95.666667
80	Clean Bandit	Solo (feat. Demi Lovato)	287.0	95.666667
2	Juice WRLD	Lucid Dreams	286.0	95.333333
72	Calvin Harris	One Kiss (with Dua Lipa)	285.0	95.000000
1	6ix9ine	FEFE (feat. Nicki Minaj & Murda Beatz)	285.0	95.000000
9	benny blanco	Eastside (with Halsey & Khalid)	284.0	94.666667
33	Tiësto	Jackie Chan	284.0	94.666667
20	Maroon 5	Girls Like You (feat. Cardi B)	284.0	94.666667
55	Jonas Blue	Rise	283.0	94.333333
14	DJ Khaled	No Brainer	283.0	94.333333
40	Nio Garcia	Te Boté - Remix	282.0	94.000000
7	XXXTENTACION	Moonlight	281.0	93.666667
58	Dynoro	In My Mind	280.0	93.333333
144	Becky G	Sin Pijama	278.0	92.666667
0	Travis Scott	SICKO MODE	276.0	92.000000
87	Martin Garrix	Ocean (feat. Khalid)	275.0	91.666667
21	XXXTENTACION	changes	275.0	91.666667
163	Ozuna	Vaina Loca	274.0	91.333333
8	Drake	Nonstop	274.0	91.333333
45	5 Seconds of Summer	Youngblood	274.0	91.333333
18	Post Malone	Psycho (feat. Ty Dolla $ign)	273.0	91.000000
25	Post Malone	rockstar (feat. 21 Savage)	273.0	91.000000
31	A$AP Rocky	Praise The Lord (Da Shine)	271.0	90.333333
240	Nicky Jam	X	271.0	90.333333
134	Marshmello	FRIENDS	270.0	90.000000
11	Lil Baby	Yes Indeed	270.0	90.000000
23	Drake	God's Plan	269.0	89.666667
184	Shakira	Clandestino	269.0	89.666667
13	Travis Scott	STARGAZING	269.0	89.666667
397	David Guetta	Flames	268.0	89.333333
263	Reik	Me Niego	268.0	89.333333
177	George Ezra	Shotgun	268.0	89.333333
66	Drake	Don’t Matter To Me (with Michael Jackson)	268.0	89.333333
37	Zedd	Happy Now	268.0	89.333333
89	Zedd	The Middle	267.0	89.000000
35	Dean Lewis	Be Alright	267.0	89.000000
107	Camila Cabello	Havana	267.0	89.000000
43	Billie Eilish	lovely (with Khalid)	266.0	88.666667
28	BlocBoy JB	Look Alive (feat. Drake)	266.0	88.666667
170	Nicky Jam	X - Remix	266.0	88.666667
393	Sofia Reyes	1, 2, 3 (feat. Jason Derulo & De La Ghetto)	266.0	88.666667
130	Selena Gomez	Back To You - From 13 Reasons Why – Season 2 S...	266.0	88.666667
334	Zion & Lennox	La Player (Bandolera)	265.0	88.333333
42	Khalid	Love Lies (with Normani)	264.0	88.000000
30	Imagine Dragons	Natural	264.0	88.000000
44	Panic! At The Disco	High Hopes	263.0	87.666667

2. Top Artists by Popularity¶

Note: the Spotify API offers a special popularity score on artist-level as well. That score is not used here.

Instead, I have used only the popularity scores of their individual tracks.

In [4]:

# show top 20 artists by number of tracks in top 100
df_top[['artist_name','track_name']].groupby('artist_name').count().sort_values('track_name', ascending=False).head(20)

Out[4]:

	track_name
artist_name
Drake	5
XXXTENTACION	5
Travis Scott	5
Post Malone	5
Juice WRLD	3
Khalid	2
David Guetta	2
Daddy Yankee	2
Ozuna	2
Nicky Jam	2
Childish Gambino	2
Selena Gomez	2
Billie Eilish	2
Tyga	2
Zedd	2
Marshmello	1
Maroon 5	1
Martin Garrix	1
Migos	1
Maluma	1

In [5]:

# show top 20 artists by total popularity of their tracks in top 100
df_top[['artist_name','popularity']].groupby('artist_name').sum().sort_values('popularity', ascending=False).head(20)

Out[5]:

	popularity
artist_name
Drake	1364.0
XXXTENTACION	1362.0
Post Malone	1343.0
Travis Scott	1303.0
Juice WRLD	797.0
Tyga	544.0
Nicky Jam	537.0
Ozuna	535.0
Zedd	535.0
David Guetta	525.0
Khalid	524.0
Selena Gomez	520.0
Billie Eilish	518.0
Daddy Yankee	515.0
Childish Gambino	513.0
Cardi B	288.0
Clean Bandit	287.0
6ix9ine	285.0
Calvin Harris	285.0
benny blanco	284.0

4. Visualizing Popularity¶

For this visualization I borrowed the code from another project of mine - Twitter 10k (plot number 5).

In [6]:

# create a new transposed dataframe where the track names are the columns and individual popularities the rows
df_top10_pop = df_top[['track_name','popularity0807','popularity0830','popularity0920']].set_index('track_name').head(10).T

# set the figure size
plt.figure(figsize=(12,18))
 
# create a color palette
palette = plt.get_cmap('Set1')

# multiple line plot of the top10 track popularities
num=0
for track in df_top10_pop.columns:
    num+=1
 
    # find the right spot on the plot
    plt.subplot(10,1, num)
    
    # plot the individual popularities
    df_top10_pop.loc[['popularity0807', 'popularity0830', 'popularity0920'],track].plot(marker='', color=palette(num), linewidth=2.5)
    
    # same limits for every subplot
    plt.ylim(90,100)
    
    # get current position of the ticks
    locs, labels = plt.xticks()

    # add ticks with custom labels
    mylabels = ['','7th August', '','', '','30th August', '','','', '20th September'] # a bit ugly but it works
    plt.xticks(locs, mylabels)

    # not ticks everywhere
    if num in range(10) :
        plt.tick_params(labelbottom=False)
        
    # add title
    plt.title(track, loc='left', fontsize=10, fontweight=0, color=palette(num))
    
# add general title
plt.suptitle("Popularity of Top 10 Tracks during Summer 2018", fontsize=13, fontweight=0, color='black', style='italic');