After retrieving some data from the Spotify API (for more info about that check out this notebook) it's time to get some insights. In this notebook, I will use data collected during the months of August and September 2018 to identify the most popular tracks and artists on Spotify using the 'popularity' featue.
From the official Spotify docs:
"The popularity of the track. The value will be between 0, for least popular, and 100 for most popular.
The popularity of a track is a value between 0 and 100, with 100 being the most popular. Popularity is based mainly on the total number of playbacks. Duplicate tracks, such as both in a single and in an album, are popularity rated differently. Note: This value is not updated in real-time and may therefore lag behind in actual popularity."
The goal is to use the previously retrieved data to gain insights from the popularity feature such as most popular tracks and most popular artists by analyzing and visualizing the data using Python libraries Pandas, Numpy and Matplotlib.
# import libraries
import glob, os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# get all csv files into one variable
path = 'Datasets/Summer2018'
all_files = glob.glob(os.path.join(path, "*.csv"))
# create lists of columns to be used when reading/merging the csv's
columns = ['artist_name','track_id', 'track_name', 'popularity']
merge_columns = ['artist_name','track_id', 'track_name']
# create dataframes by reading the csv's in all_files
df_from_each_file = (pd.read_csv(f, usecols=columns) for f in all_files)
# create empty dataframe with the defined column structure
df = pd.DataFrame(columns=columns)
# loop over dataframes and merge into one dataframe
# outer join in order to keep the popularity column from each file
for df_, files in zip(df_from_each_file, all_files): # all_files are here to provide the column suffix (0920,0830 etc)
df = df.merge(df_, how='outer', on=merge_columns, suffixes=('',str(files)[-8:-4]))
print('Shape: ', df.shape)
df.head()
Shape: (13257, 7)
popularity | artist_name | popularity0920 | track_id | track_name | popularity0830 | popularity0807 | |
---|---|---|---|---|---|---|---|
0 | NaN | Travis Scott | 96.0 | 2xLMifQCjDGFmkHkpNLD9h | SICKO MODE | 94.0 | 86.0 |
1 | NaN | 6ix9ine | 96.0 | 2E124GmJRnBJuXbTb4cPUB | FEFE (feat. Nicki Minaj & Murda Beatz) | 95.0 | 94.0 |
2 | NaN | Juice WRLD | 96.0 | 0s3nnoMeVWz3989MkNQiRf | Lucid Dreams | 95.0 | 95.0 |
3 | NaN | Drake | 100.0 | 2G7V7zsVDxg1yRsu7Ew9RJ | In My Feelings | 100.0 | 100.0 |
4 | NaN | XXXTENTACION | 95.0 | 3ee8Jmje8o58CHK66QrVC2 | SAD! | 95.0 | 98.0 |
Since I have merged 3 files based on artist and track names there shouldn't be a lot duplicates.
However, it is still worth to do a quick drop_duplicates here.
# drop duplicate tracks
df.drop_duplicates(subset=['artist_name','track_name'], inplace=True)
print('Shape after dropping: ', df.shape)
Shape after dropping: (12933, 7)
# sum individual popularity scores
df['popularity'] = df[['popularity0920', 'popularity0830', 'popularity0807']].sum(axis=1)
# calculate also the mean popularity score
df['popularity_mean'] = df[['popularity0920', 'popularity0830', 'popularity0807']].mean(axis=1)
# create new dataframe df_top ordered consisting of the 100 most popular tracks
df_top = df.sort_values('popularity', ascending=False).head(100)
# show the first 50 results
df_top[['artist_name', 'track_name', 'popularity', 'popularity_mean']].head(50)
artist_name | track_name | popularity | popularity_mean | |
---|---|---|---|---|
3 | Drake | In My Feelings | 300.0 | 100.000000 |
4 | XXXTENTACION | SAD! | 288.0 | 96.000000 |
12 | Cardi B | I Like It | 288.0 | 96.000000 |
6 | Tyga | Taste (feat. Offset) | 287.0 | 95.666667 |
10 | Post Malone | Better Now | 287.0 | 95.666667 |
80 | Clean Bandit | Solo (feat. Demi Lovato) | 287.0 | 95.666667 |
2 | Juice WRLD | Lucid Dreams | 286.0 | 95.333333 |
72 | Calvin Harris | One Kiss (with Dua Lipa) | 285.0 | 95.000000 |
1 | 6ix9ine | FEFE (feat. Nicki Minaj & Murda Beatz) | 285.0 | 95.000000 |
9 | benny blanco | Eastside (with Halsey & Khalid) | 284.0 | 94.666667 |
33 | Tiësto | Jackie Chan | 284.0 | 94.666667 |
20 | Maroon 5 | Girls Like You (feat. Cardi B) | 284.0 | 94.666667 |
55 | Jonas Blue | Rise | 283.0 | 94.333333 |
14 | DJ Khaled | No Brainer | 283.0 | 94.333333 |
40 | Nio Garcia | Te Boté - Remix | 282.0 | 94.000000 |
7 | XXXTENTACION | Moonlight | 281.0 | 93.666667 |
58 | Dynoro | In My Mind | 280.0 | 93.333333 |
144 | Becky G | Sin Pijama | 278.0 | 92.666667 |
0 | Travis Scott | SICKO MODE | 276.0 | 92.000000 |
87 | Martin Garrix | Ocean (feat. Khalid) | 275.0 | 91.666667 |
21 | XXXTENTACION | changes | 275.0 | 91.666667 |
163 | Ozuna | Vaina Loca | 274.0 | 91.333333 |
8 | Drake | Nonstop | 274.0 | 91.333333 |
45 | 5 Seconds of Summer | Youngblood | 274.0 | 91.333333 |
18 | Post Malone | Psycho (feat. Ty Dolla $ign) | 273.0 | 91.000000 |
25 | Post Malone | rockstar (feat. 21 Savage) | 273.0 | 91.000000 |
31 | A$AP Rocky | Praise The Lord (Da Shine) | 271.0 | 90.333333 |
240 | Nicky Jam | X | 271.0 | 90.333333 |
134 | Marshmello | FRIENDS | 270.0 | 90.000000 |
11 | Lil Baby | Yes Indeed | 270.0 | 90.000000 |
23 | Drake | God's Plan | 269.0 | 89.666667 |
184 | Shakira | Clandestino | 269.0 | 89.666667 |
13 | Travis Scott | STARGAZING | 269.0 | 89.666667 |
397 | David Guetta | Flames | 268.0 | 89.333333 |
263 | Reik | Me Niego | 268.0 | 89.333333 |
177 | George Ezra | Shotgun | 268.0 | 89.333333 |
66 | Drake | Don’t Matter To Me (with Michael Jackson) | 268.0 | 89.333333 |
37 | Zedd | Happy Now | 268.0 | 89.333333 |
89 | Zedd | The Middle | 267.0 | 89.000000 |
35 | Dean Lewis | Be Alright | 267.0 | 89.000000 |
107 | Camila Cabello | Havana | 267.0 | 89.000000 |
43 | Billie Eilish | lovely (with Khalid) | 266.0 | 88.666667 |
28 | BlocBoy JB | Look Alive (feat. Drake) | 266.0 | 88.666667 |
170 | Nicky Jam | X - Remix | 266.0 | 88.666667 |
393 | Sofia Reyes | 1, 2, 3 (feat. Jason Derulo & De La Ghetto) | 266.0 | 88.666667 |
130 | Selena Gomez | Back To You - From 13 Reasons Why – Season 2 S... | 266.0 | 88.666667 |
334 | Zion & Lennox | La Player (Bandolera) | 265.0 | 88.333333 |
42 | Khalid | Love Lies (with Normani) | 264.0 | 88.000000 |
30 | Imagine Dragons | Natural | 264.0 | 88.000000 |
44 | Panic! At The Disco | High Hopes | 263.0 | 87.666667 |
Note: the Spotify API offers a special popularity score on artist-level as well. That score is not used here.
Instead, I have used only the popularity scores of their individual tracks.
# show top 20 artists by number of tracks in top 100
df_top[['artist_name','track_name']].groupby('artist_name').count().sort_values('track_name', ascending=False).head(20)
track_name | |
---|---|
artist_name | |
Drake | 5 |
XXXTENTACION | 5 |
Travis Scott | 5 |
Post Malone | 5 |
Juice WRLD | 3 |
Khalid | 2 |
David Guetta | 2 |
Daddy Yankee | 2 |
Ozuna | 2 |
Nicky Jam | 2 |
Childish Gambino | 2 |
Selena Gomez | 2 |
Billie Eilish | 2 |
Tyga | 2 |
Zedd | 2 |
Marshmello | 1 |
Maroon 5 | 1 |
Martin Garrix | 1 |
Migos | 1 |
Maluma | 1 |
# show top 20 artists by total popularity of their tracks in top 100
df_top[['artist_name','popularity']].groupby('artist_name').sum().sort_values('popularity', ascending=False).head(20)
popularity | |
---|---|
artist_name | |
Drake | 1364.0 |
XXXTENTACION | 1362.0 |
Post Malone | 1343.0 |
Travis Scott | 1303.0 |
Juice WRLD | 797.0 |
Tyga | 544.0 |
Nicky Jam | 537.0 |
Ozuna | 535.0 |
Zedd | 535.0 |
David Guetta | 525.0 |
Khalid | 524.0 |
Selena Gomez | 520.0 |
Billie Eilish | 518.0 |
Daddy Yankee | 515.0 |
Childish Gambino | 513.0 |
Cardi B | 288.0 |
Clean Bandit | 287.0 |
6ix9ine | 285.0 |
Calvin Harris | 285.0 |
benny blanco | 284.0 |
For this visualization I borrowed the code from another project of mine - Twitter 10k (plot number 5).
# create a new transposed dataframe where the track names are the columns and individual popularities the rows
df_top10_pop = df_top[['track_name','popularity0807','popularity0830','popularity0920']].set_index('track_name').head(10).T
# set the figure size
plt.figure(figsize=(12,18))
# create a color palette
palette = plt.get_cmap('Set1')
# multiple line plot of the top10 track popularities
num=0
for track in df_top10_pop.columns:
num+=1
# find the right spot on the plot
plt.subplot(10,1, num)
# plot the individual popularities
df_top10_pop.loc[['popularity0807', 'popularity0830', 'popularity0920'],track].plot(marker='', color=palette(num), linewidth=2.5)
# same limits for every subplot
plt.ylim(90,100)
# get current position of the ticks
locs, labels = plt.xticks()
# add ticks with custom labels
mylabels = ['','7th August', '','', '','30th August', '','','', '20th September'] # a bit ugly but it works
plt.xticks(locs, mylabels)
# not ticks everywhere
if num in range(10) :
plt.tick_params(labelbottom=False)
# add title
plt.title(track, loc='left', fontsize=10, fontweight=0, color=palette(num))
# add general title
plt.suptitle("Popularity of Top 10 Tracks during Summer 2018", fontsize=13, fontweight=0, color='black', style='italic');